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METHODS AND COMPOSITIONS FOR SELECTING TAG NUCI-EIC 

ACIDS AND PROBE ARRAYS 

COPYRIGHT NOTICE 
A portion of the disclosure of this patent document contains material which 
is subject to copyright protection. The copyright owner has no objection to the facsimile 
t^roductiott by anyone of the patent document or the patent disclosure as it appears in 
the Patent and Trademark Office patent file or records, but otherwise reserves all 
copyright rights whatsoever. 

FI£U> OF THE INVENTION 
This invention provides sets of nucleic add tags, arrays of oligonucleotide 
probes, nucleic acid-tagged sets of recombinant cells and other compositions, and methods 
of selecting oligonucleotide probe arrays. The invention relates to the selection and 
interaction of nucleic acids, and nucleic acids immobilized on solid substrates, including 
related chemistry, biology^ and medical diagnostic uses. 

BACKGROIIND OF THE INVENTION 

Methods of forming large arrays of oligonucleotides and other polymers on 
a soUd substrate aie known. Pirrung et al„ U.S. Patent No. 5,143,854 (see also PCT 
AppUcation No. WO 90/15070). McQall et oL. U.S. Patent No. 5,412,087, Chee et al. 
SN PCT/USd4/12305, and Fodor et al., PCT Publication No. WO 92/10092 describe 
methods of forming arrays of oligonucleotides and other polymers using, for example, 
light-directed synthesis techniques. 

In the Fodor et al, pubUcation, methods are described for using computer- 
controlled systems to direct polymer anay synthesis. Using the Fodor approach, one 
heterogenous array of polymers is converted, through simultaneous coupling at multiple 
reaction sites, into a different heterogenous array. See also, Fodor et al. (1991) Science, 
251: 767-777; Lipshutz et al, (1995) BloTechniques 19(3): 442-447; Fodor et al. (1993)' 
Nature 364: 555-556; and McdUn (1995) Environmerual Health Perspectives 244-246. 



The arrays are typically placed on a solid surface with an area less than 1 inctf . although 
much larger surfaces are optionally used. 

Additional methods applicable to polymer synthesis on a substrate are 
described, e.g., in US Pat. No. 5.384.261. incorporated herein by reference for all 
purposes. In the methods disclosed in these applications, reagents are delivered to the 
substrate by flowing or spotting polymer synthesis reagents on predefined regions of the 
solid substrate. In each instance, certain activated regions of the substrate are physically 
separated from other regions when the monomer solutions are delivered to the various 
reaction sites, e.g., by means of groves, wells and the like. 

Procedures for synthesizing polymer arrays are referred to herein as very 
large scale immobilized polymer synthesis (VLSIPS™) procedures. Oligonucleotide 
VLSIPS- arrays are useful, for instance, in a variety of procedures for monitoring test 
nucleic acids in a sample. In probe arrays with multiple probe sets, many distinct 
hybndization interactions can be monitored simultaneously. However, unwanted 
hybndization between probes, or between probes and other nucleic acids, can make 
analysis of multiple hybridizations problematic. This invention solves these and other 
problems. 

SUMMARY OF THE INVENTION 
With this invention it is now possible to label and detect many individual 
components present, imer alia, in molecular, cellular and viral libraries using a limited 
number of hybridization conditions. Components are labeled with specially selected 
nucleic acid tags, and the presence of individual tags is monitored by hybridization to a 
probe array (typically a VLSIPy- array of oligonucleotide probes). Thus, the tag nucleic 
acids are labels for the individual components, and the probe array provides a label reader 
which permits simultaneous detection of a very large number of tag nucleic acids, n^is 
facilitates massive parallel analysis of all of the components in a mixture in a single 
assay. 

For instance, as explained herein, all of the members of a cellular library 
can be tested for response to an environmental stimulus using a mixture of all of the 
members of the cellular library in a single assay. This is accomplished, e.g., by labeling 
each member of the cellular library, e.g., by cloning a nucleic acid tag into each cell type 
m.the library, mixing each cell type in the library in an appropriate solution, and 
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exposing part of the solution to the selected environmental stimulus. The distribution of 
nucleic acids in the library before and after the environmental stimulus is compaied by 
hybridization of the nucleic acids to a VLSIPS™ array, allowing for detecUon of cells 
which are specifically affected by the environmental stimulus. 
5 Accordingly, the present invention provides, inter alia, tag nucleic acids, 

sets of tag nucleic'acids, methods of selecUng tag nucleic acids, libraries of cells, viruses 
or the like containing tag nucleic acids, arrays of oligonucleotide probes, arrays of 
VLSIPS™ probes, methods of selecting arrays of oligonucleotide probes, methods of 
detecting tag nucleic acids with VLSIPS™ arrays and other features which will become 
10 clear upon further reading. 

In one class of embodiments, the invention provides a method of selecting 
a set of tag nucleic acids designed for minimal cross hybridization to a VLSIPS™ array. 
The absence of cross hybridization facilitates analysis of hybridization patterns to 
VLSIPS™ arrays, because it reduces ambiguities in the interpretation of hybridization 
results which arise due to multiple nucleic acid species binding to a single species of 
probe on the VLSIPS™ array. Thus, in the selection metiiods of the invention, potential 
tags are excluded from set of tags where they bind to the same nucleic acid as selected 
tags under stringent conditions. The selection metiiods typically include the steps of 
selecting a specific thermal binding stability for the tag acids against complementary 
probes, and excluding tags which contain self-complementary regions. Often, the Uiermal 
binding stability of the tags is selected by specifying parameters which influence binding 
stability, such as the length and base composition {e.g., by selecting tags with the same 
AT to GC ratio of nucleotides) for the tag nucleic acids is selected. In this regard, tags 
which form more GC bonds upon binding a complementary probe require fewer overall 
bases to have the same binding stability with a complementary probe as tags which have 
fewer GC residues. Binding stability is also affected by base stacking interactions, the 
formatioh of secondary structures and tiie choice of solvent in which a tag is bound to a 
probe. 

The size of the tags can vary substantially, but is typically from about 8- 
30 150 nucleotides, more typically between 10 and 100 nucleotides, often between about 15 
and 30 nucleotides, generally between about 15 and 25 nucleotides and, in one preferred 
embodiment, about 20 nucleotides in length. In a few applications, the tags are 
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substantially longer than the probes to which they hybridize. The use of longer tags 
increases the number of tags from which non-cross hybridizing probes can be selected. 

The tag nucleic acids are optionaUy selected to have constant and variable 
regions, which facilitates elimination of secondary structure arising from self- 
complementarity, and provides structural features for cloning and amplifying the tags 
For mstance. PGR binding sites or restriction enzyme sites are optionally incorporated 
into constant regions in the tags. In other embodiments, short constant regions are added 
m coding theory methods to prevent misaUgnment of the tags. Constant regions are 
opuonally cleaved from the tag during processing steps, for instance by cleaving the tag 
nucleic acids with class II restriction enzymes. 

Often it is desirable to eliminate tags which contain runs of 4 nucleotides 
selected from the group consisting of 4 X residues 4 Y residues and 4 Z residues where 
X IS selected from the group consisting of G and C. Y is selected from the group 
consisting of G and A, and Z is selected from the group consisting of A and T The 
ehmmation of tags from a tag set which contain such runs of nucleotides reduces the 
formation of secondary structure in the selected tags in the tag set. In some 
embodiments, certain runs are permitted, while others are excluded. For instance, in one 
embodiment, runs of 4 A/T or G/C nucleotides are prohibited. 

In many embodiments, tags which differ by fewer than about 80% of the 
total number of nucleotides which comprise the tags are excluded. For instance all 
selected tags in a selected tag set preferably differ by at least about 4-5 nucleotides. It is 
also desirable to exclude tags which share substantial regions of sequence identity 
because the regions of identity can cross-hybridize to nucleic acids which have 
subsequence complementary to the region of identity. For instance, where 20-mer tags 
are identical over regions of 9 or more nucleotides, they are typically excluded. 

The tags in the tag sets of the invention typically differ by at least two 
nucleotides. a.d preferably by 3-5 nucleotides for a typical 20-mer. A list of tags which 
differ by at least two nucleotides can be generated by pairwise comparison of each tag. or 
by other methods. For instance., the tag sequences can be aligned for maximal 
correspondence and tags with a single-mismatch discarded. In one class of embodiments 
the number of A4-G nucleotides in each of the variable regions of each of the tags is 
selected to be even (or. alternatively, odd), providing a "parity base" or "error correcting 
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base" which provides that each tag have at least two hybridization mismatches between 
every tag m the tag set, and any individual complementary nucleic acid probe (other than 
the probe which is a perfect complemem to the tag). Other methods of ensuring that at 
least two mismatches exist between every tag in a tag set and any individual hybridization 
probe are also appropriate. 

In general, the selection of the tag nucleic acids facilitates selection of the 
probe nucleic acids, e.g.. on VLSIPS™ arrays used to monitor the tag nucleic acids by 
hybndization. Specifically, the probes on the array are selected for their ability to 
hybndize to variable sequences in the set of tag nucleic acids (the "variable- region of a 
tag which does not include a constant region is the entire tag). Thus, all of the rules for 
selection of tag nucleic acids can be applied to the selection of probe nucleic acids for 
example by performing the tag selection steps and then determining the complementary 
set of probe nucleic acids. 

In another class of embodiments, the invention provides compositions 
composing sets of tag nucleic acids, which include a plurality of tag nucleic acids In 
preferred embodiments, the set of tag nucleic acids comprises from 100-100 000 tags 
Typically, a tag set will include between about 500 and 15.000 tags. Usually, the number 
of tags in a tag set is between about 5.000 and about 14.000 tags. In one preferred 
embodiment, a set of tags of the invention comprises about 8.000-9.000 tags The tag 
sequences typically comprise a variable region, where the variable region for each tag 
nucleic acid in the set of tag nucleic acids has the same the same G+C to A+T ratio 
approximately the same T„. the same length and do not cross-hybridize to a single 
complementary probe nucleic acid. Most typically, the tag nucleic acids, in the set of tag 
nucleic acids cannot be aligned with less than two differences between any two of the tag 
nucleic acids in the set of tag nucleic acids-, and often at least 5 differences exist between 
any pair of tags in a tag set. In one embodiment, the tags also comprise a constant region 
such as a PGR primer binding site for amplification of the tag. 

. In one class of embodiments, the invention provides a method of labeling a 
composmon. comprising associating a tag nucleic acid with the composition, wherein the 
tag nucleic acid is selected from a group of tag nucleic acids which do not cross-hybridize 
and which have a substantially similar T„. Typically, the tag labels are detected with a 
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VLSIPS™ array which comprises probes complementary to the tags used to label the 
composition. 

As described herein, preferred conipositions include constituents of 
cellular, viral or molecular libraries such as recombinant cells, recombinant viruses or 
polymers. However, one of skill will readily appreciate that other compositions can also 
be labeled using the nucleic acid tags of the invention, and the tags detected using 
VLSIPS™ arrays. For instance, high denomination currency can be labeled with a set of 
nucleic acid tags, and counterfeits detected by monitoring hybridization of a wash of the 
currency (or, e.g., a PGR amplificaUon of attached nucleic acids which encode tag 
sequences) with an appropriate VLSIPS™ array. 

In another class of embodiments, the invention provides methods of 
pre-selecting experimental probes in an oligonucleotide probe array, wherein the probes 
have substantially uniform hybridization properties and do not cross hybridize to a target 
tag nucleic acid. In the methods, a ratio of G+C to A+T nucleotides shared by the 
experimental probes in the array is selected and all possible 4 nucleotide subsequences for 
the probes of the array are determined. All potential probes from the array which contain 
prohibited 4 nucleotide sub-sequences are excluded from the experimental probes of the 
array. 4 nucleotide subsequences are prohibited when the nucleotide subsequences are 
selected from the group consisting of self-complementary probes. A4 probes, T, probes, 
and [G.C]« probes. Also, where the target tag nucleic acid comprises a constant region, 
all probes complementary to the constant region sub-sequence of the target tag nucleic 
acid are prohibited, and not present in tiie tag set. Typically, a length for the probes in 
the array is selected, although non-hybridizing portions of the probe (i.e., nucleotides 
which do not hybridize to a target nucleic acid) optionally vary between different classes 
of probes. "Experimental probes" hybridize to a target tag nucleic acid, while "control- 
probes cither do not hybridize to a target tag nucleic acid, or bind to a nucleic acid which 
has hybridization properties which differ from those of the target tag nucleic acids in a 
tag nucleic acid set. For instance, control probes are optionally used in VLSIPS™ arrays 
to check hybridization stringency against a known nucleic acid. 

In one class of methods of the invention, a plurality of test nucleic acids 
are simultaneously detected in a sample. In the methods, an array of experimental probes 
which do not cross hybridize to a target under stringent conditions is used to detect the 



target nucleic acids. Typically, the ratio of G-hC bases in each experimental probe is 
substantially identical. The probes of the array axe arranged into probe sets in which 
each probe set comprises a homogeneous population of oligonucleotide probes* For 
example, many individual probes with the same nucleotide sequence are arranged in 
proximity to one another on the surface of an array to form a particular geometric shape. 
Probe sets are arranged in proximity to each other to form an array of probes. For 
instance, where the probe array is a VLSIPS^ arnty, tlie probe sets arc optionally 
arranged into squares on the surface of a substrate, forming a checkerboard pattern of 
probe sets on the substrate. 

The probes of the array specifically hybridize to at least one test nucleic 
acid in the sample under stringent hybridization conditions. The method further 
comprises denting hybridizatioh of the test nucleic acids to the array of oligonucleotide 
probes. Typically, the test nucleic acids comprise tag sequences, which bind to the 
experimental probes of the array. 

In one class of embodiments, the invention provides an array of 
oligonucleotide probes comprising a plurality of experimental oligonucleotide probe sets 
attadied to a solid substrate, wherein each experimental oligonucleotide probe set in the 
array hybridizes to a different target nucleic acid under stringent hybridization conditions. 
Each experimental oligonucleotide probe in the probe sets of the array comprises a 
constant region and a variable region. The variable region does not cross hybridize with 
the constant region under stringent hybridization conditions, and the nucleic acid probes 
do not cross-hybridize to target nucleic acids. Typically, the probes from each probe set 
differ firom the probes of every other probe set in the array by the arrangement of at least 
two nucleotides in the probes of the probe set. Generally, the ratio of G+C bases in 
each probe for each experimental probe set is substantially identical (meaning that the 
G+C ratio does not vary by more than 5%). thereby assuring that they hybridize to a 
target with similar avidity under similar hybridization conditions. The arrays optionally 
comprise control probes, e.g., to assess hybridization conditions by monitoring binding of 
a known quantxtated nucleic acid to the control probe. 

In another class of embodiments, the invention provides a plurality of 
recombinant cells or recombinant viruses comprising tag nucleic acids, which tag nucleic 
acids comprise a constant region and a variable region. Typically, the variable region for 
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each tag nucleic acid in the set of tag nucleic acids has approximately the saxn T^, (e.g,, 
the same G+C to A+T ratio and the same length) and does not cross-hybiidizc to a 
probe nucleic acid. Different tag nucleic a<ads in the set of tag nucleic acids found in the 
different recombinant cells cannot be aligned with less than two differences between the 
t^ nucleic acids. Generally, the recombinant cells are selected from a library of 
genetically distinct recombinant cells (eukaryotic, prokaxyodc or archaebacterial) or 
viruses. For example, in one class of preferred embodiments, the cells are yrast cdUls. 
In another class of preferred embodiments, the cells are of mammalian origin. 

The present invention provides arrays of oligonucleotides attached to solid 
subst^tes. Typically, the oligonucleotide probes in the array are arranged into probe sets 
at defined locations in the array to enhance signal processing of hybridization inactions 
between the oligonucleotide prcbes and test nucleic acids in a sample. The 
oUgonucleotide arrays can have virtually any number of different oligonucleotide sets, 
detehnined largely by the number or variety of test nucldc acids or nucleic acid tags to 
be screened against the anay in a given application. In one group of embodiments, the 
array has from 10 up to 100 oligonucleotide sets. In other groups of embodiments, the 
arrays have between 100 and 10,000 seu. In certain embodiments, the arrays have 
between 10,000 and 100,000 sets, arid in yet otiter embodiments the arrays have between 
100,000 and 1,000,000 sets. Most preferred embodiments will have between 7,500 and 
12,500 sets. For example in one preferred embodiment, the arrays will comprise about 
8,000 sets of oligonucleotide probes. In prefened embodiments, the array will have a 
density of more than 100 sets of oligonucleotides at known locations per cni*, or more 
preferably, more than 1000 sets per cm^. In some embodiments, the arrays have a 
density of more than 10,000 sets per cm'. 

The present invention also provides kits embodying the inventive concepts 
outiined above. For example, kits of the invwition comprise any of the arrays, cells, 
libraries or tag sets described herein. Also, because the methods of using the arrays and 
tags optionally include PGR, LCR and other in vino amplification techniques for 
amplifying tag nucleic acids, the kits of the invention optionally include reagents for 
practicing In vitro amplification methods such as tag polymerase, nucleotides, computer 
software with tag selection programs and the like. The kits also optionaUy comprise 
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nucleic acid labeling reagents, instructions, containers and other items that will be 
apparent to one of skill upon further reading. 

BRIEF DESCRIPTION OF THE DRAWING 
Figure 1 is a Scanned image of a 1.28cm by 1.28cm high-density array 
hybridized with a fluorescenUy labeled control oligonucleotide. The array contains 
complementary sequences to 4.500 20mer tags selected as described in Table 1. Control 
oligonucleotides are synthesized in the comers and in a cross-hair pattern across the array 
to verify the uniformity of the synthesis and the hybridization conditions. "DNA TAGS" 
was spelled out with control oligonucleotides as well. The dark areas indicate tije 
location of the 4,500 20 base molecular tags. Note tiiat there is no cross-hybridization of 
the control oligonucleotide and the molecular tag sequences. 

Figure 2 shows a PCR-targeting strategy used to generate tagged deletion 
strains, (a) The ORF is identified from the sequence information in the database. 
Regions immediately flanking the ORF are used to generate the deletion strain, (b) The 
selectable marker (kan-) is amplified using a pair of long primers to generate an ORF. 
specific deletion construct. The up-stream 86mer primer consists of (5* to 3'): 30 bases 
of yeast homology, an 18 base common tag priming site, a 20 base molecular tag, and a 
22 base sequence Uiat is homologous to one side of the marker. The down-stream 
oligonucleotides consists of 50 bases of yeast homology to the other side of the targeted 
ORF and 16 bases that are homologous to the other side of the marker. The dashed lines 
representing the long oligonucleotides illustrates that the primers are unpurified and are 
missing sequence on the 5' end. (c) A second round of PCR with.20mers homologous to 
the ends of the initial PCR product was used to "flush" the ragged ends generated by 
25 unpurified oligonucleotide in the first round, (d) The resulting marker flanked by yeast 
ORF homology on either side is transformed directiy into haploid yeast strain and 
homologous recombination results in the replacement of the targeted ORF with the 
marker, 20mer tag, and tag priming site. 

Figure 3 shows oligonucleotides used to generate the ADEl tagged deletion 
strain. Similar sets of oligonucleotides were synthesized for the other ten auxotrophic 
ORFS 
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Figure 4 shows transformation results and tag information for eleven 
auxotrophic ORFS. Eight colonies from each transformation were analyzed by replica 
Plaung and PGR the resulting targeting efficiency is shown for each of the oLs Tt. 

- — - - - - ^ --^ 

Figure 5 shows Tag wpliflcaUon strategy described in Exan.ple I 
(a A deleuon pool generated by combining equal numbers of the eleven tagged ' 
deietton strains described in Figure 3. Genomic DNA isolated from a represenuUve 
aitquo. Of the pool was used as template for a tag amplificadon reaction, (b) Tags were 
» using a single pair of prime, that are homologous to the common priming sites 
wh ch flan, each tag. On. of the common primers is labeled with 5- fluoresLn and 

PCK — .He 

PGR .generates a popuIaUon of single-stranded fluoresccntly labeled eOmer tag amplicons 
.hat are d.rectly hybridized the high-density 20mer array which is then wash^ and 
~ (d) An actual scanned image of the array shows the (predicted, hybridiziuion 

Pattern for the tags With Virtually no cross-hybridization on the rest Of the chip A 
closeup wew of the left hand comer shows the locadon of the tags for each of the 
different deleUon strains. " 

Figure 6 shows the analysis of a deleUdn pool containing 1 1 tagged 
auxotrophic deleUon strains. A deletion pool was generated by combihing e,^ numbers 
o cells from each of the 1 1 deleUon strains described in Figure 3. ReprLltive 
a^tquots were grown in (A) complete media (SDC). (B) media missing adenine (SDC- 
ADE), (Q or media missing tryptophan (SDC-TRP). CeUs were harvested at the 
.ndtcated time points and cenomic DNA was Isolated. Tags w« amplified from the 
^om,c DNA and labeled amplicons were direcUy hybridized to the high-density array 
fer 30 mtnutes. washed, and scanned. A blowup of the upper left hand comer f r Jh 
of the scans is shown. 

DEFINITIONS 

Unless defined otherwise, technical and scientific terms used herein have 
the same meaning as commonly understood by one of ordinary skill in the art to which 
thts mventton belongs. Singleton « of. (1994) Diaionary of MicroMolo,y and Molecular 
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Biology, second edition, John Wiley and Sons (jNew York) and March (March. Advanced 
Organic Chemistry Reactions. Mechanisms and Structure 4th ed J. Wiley and Sons (New 
York. 1992) provides one of skill with a general guide to many of the terms used in this 



invention. 



Although one of skill wiU recognize many methods and materials similar or 
equivalent to those described herein which can be used in the practice of the present 
invention, the preferred methods and materials are described. For purposes of the present 
invention, the following terms are defined below. 

"Eukaryotic- cells are cells which contain at least one nucleus in which the 
cell's genomic DNA is organized, or which are the differentiated offspring of cells which 
contained at least one nucleus. Eukaryotes are distinguished from prokaryotes which are 
cellular organisms which carry their genomic DNA in the cell's cytoplasm. 

A "nucleoside- is a pentose glycoside in which the aglycone is a 
heterocyclic base; upon the addition of a phosphate group the compound becomes a 
nucleotide. The major biological nucleosides are iS-glycoside derivatives of D-ribose or 
D-2-deoxyribose. Nucleotides are phosphate esters of nucleosides which are acidic due to 
the hydroxy groups on die phosphate. The polymerized nucleotides deoxyribonucleic acid 
(DNA) and ribonucleic acid (RNA) store the genetic information which controls all 
aspects of an organism's interaction witii its environment. The nucleosides of DNA and 
RNA are connected together via phosphate units attached to the 3 position of one pentose 
and the 5 position of the next pentose. 

A -nucleic acid" is a deoxyribonucleotide or ribonucleotide polymer in 
either single- or double-stranded form, and unless otherwise limited, encompasses known 
analogs of natural nucleotides that function in a manner similar to naturally occurring 
nucleotides. 

An "oligonucleotide" is a nucleic acid polymer composed of two or more 
nucleotides or nucleotide analogues. An oligonucleotide can be derived from natural 
sources but is often synthesized chemically. It is of any size. 

An "oligonucleotide array" is a spatially defined pattern of oligonucleotide 
probes on a solid support. A "preselected array of oligonucleotides" is an array of 
spatially defined oligonucleotides on a solid support which is designed before being 
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deliberate, and not random). 

,v„.>,. ■ . • 'I, '"""^ oligonuclwude 
synft«« ca^s a pro.^ pHospKau= .„ 3. hydroxy, of „e ribose. TTus 

^e„«. nucleoside pKospta«, nucIeo.ide-3--phospha.es, nucleoside phospho^nidiBs, 
Phosphoranudues. nucleoside phosphona,es,p>,ospho„ates and me like. Itisgenendly 
und.«ood ^ nucleotide reagenu cany a protected phosphate group in order .0 fori a 
phosphodiester linkage. ' ^ w rorm a 

designed to bll''""'"''"' " °' are 

another ^cuve site. More particularly, the protecting gtoups used herein can be any 
^.hose g™,ps described i„ Creene, „ ^.„„,, ,„ / 

Ed ,„h„ Wtley Sc Sons, New York, NY. ,991, which is incorporated herein by 
-ference. The proper selection of protecting groups for a parUcular synthesis is 
governed by the ovetall methods employed in the synthesis. For example, in -light- 
.iTCcted- synthesis, discussed herein, the protecting groups are typically phctolabile 

- Protecung groups such as NVOC, MeNPoc anH fh«c» A- i ^ • 
^ ' ^^exNi'oc, and those disclosed in co-pendine 

App ,„tio„ PCT/US93/10162 <„,ed October 22. 1,93), incorporated herein by reference. 
In other methods, protecting groups are removed by chemical methods and include groups 
such as FMOC, DMT others Icnown to those of skill in the an. ' 

A -solid substrate" has a frxed organizational support matrix, such as 
stlKa, polymeric matedals, or glass. In some embodiments, a. least one surface of the 
substme .s partially planar. In other embodiments it is de^rable .0 physically separate 
regtons of the substrate to delineate synthetic regions, for example with trenches 
grooves, wells or the like. Example of solid substrates include slides, beads and 
PC^meric chips. A solid suppor, is "mnctionali^- .0 permit the coupling of ntonomers 
used ,„ polymer syndtesis. For example, a solid suppor, is optionally coupled to a 
nucleoside monomer through a covalent linkage ,0 the 3--carbon on a furanose. Solid 
suppor, materials typically are unreacUve during polymer synthesis, providing a 
substtatum to anchor the growing polymer. Solid support materials include, but are not 
'.mtted to. glass, silica, controlled pore glass (CPG). polystyrene, polystyrene/latex. and 
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carboxyl modified Teflon. The solid substrates arc biological, nonbiological, organic, 
inorganic, or a combination of any of these, existing as pattides, steands, precipitates, 
g^, sheets, tubing, spheres, containKS, c^Uaxies, pads, slices, films, plates, slides, 
etc. dqiending upon the paxticular application. In light-directed synthetic techniques, the 
solid substrate is often planar but optionally takes on alternative surface configurations. 
For example, the solid substrate optionally contains raised or depressed regions on which 
syndesis takes place. In some embodiments, the solid substraie is chosen to provide 
appropriate light-absorbing characteristics. For example, the substrate may be a 
polymerized Langmuir Blodgett film, functionalized glass. Si, Ge, GaAs, GaP, SiOj, 
SiN4, modified silicon, or any one of a variety of gels or polymers such as 
CE>oly)tettafluoroethylene, (poly)vinyUdendifluoride, polystyrene, polycarbonate, or 
combinations thereof. Other suitable solid substrate materials will be readily apparent to 
those of skill in the art. Preferably, the surface of the solid substrate will contain reactive 
groups, such as caiboxyl, amino, hydroxyl. thiol, or the like. More preferably, the 
surfece is optically transparent and has surface Si— OH functionaUties, such as are found 
on silica surfaces, A. substrate is a material having a rigid or semi-rigid surfece. In 
spotting or flowing VLSIPS™ techniques, at least one surfiice of the solid substrate ifr 
optionally planar, although in many embodiments it is desirable to physically separate 
synthesis regions for different polymers witii, for example, weUs, raised regions, etched 
trenches, or tiie like. In some embodiments, the substrate itself contains weUs, trenches, 
flow through r^ions. etc. which form all or part of the regions upon which polymer 
synthesis occurs. 

The term "recombinant" when used with reference to a cell or virus 
indicates that tiie cell or virus encodes a DNA or RNA whose origin is exogenous to the 
cdl or virus. Thus, for example, recombinant cells optionally express nucleic acids {e.g., 
RNA) not found within tiie native (non-recombinant) form of flie cell. 

"Stringent" hybridization conditions are sequence dependent and will be 
diffwent with differaw environmental parameters (salt concentrations, presence of 
organics etc.). Generally, stringent conditions are selected to be about 5" C to 20''C 
tower than tiie tiiermal melting point (T,^ for tiie specific nucleic acid sequence at a 
defined ionic sttengtii and pH. Preferably, stringent conditions are about 5" C to 10' C 
lower than ttie thermal melting point for a specific nucleic acid bound to a complementary 
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nucleic acid. The T„ is the temperature (under defined ionic strength and pH) at which 
50% of a nucleic acid {e.g.. tag nucleic acid) hybridizes to a perfectly matched probe. 
-Thermal binding stability" is a measure of the temperature-dependent stability of a 
nucleic acid duplex in solution. The thermal binding stability for a duplex is dependent 
on the solvent, base composition of the duplex, number and type of base pairs, position of 
base pairs in the duplex, length of the duplex and the like. . 

"Stringent" wash conditions are ordinarily determined empirically for 
hybridization of each set of tags to a corresponding probe array. The arrays are first 
hybridized (typically under stringent hybridization conditions) and then washed with 
buffers containing successively lower concentrations of salts, and/or higher concentrations 
of detergents, and/or at increasing temperatures until the signal to noise ratio for specific 
to non-specific hybridization is high enough to facilitate detection of specific 
hybridization. 

Stringent temperature conditions will usually include temperatures in excess 
of about 30»C. more usually in excess of about 37" C. and occasionally in excess of 
about 45=C. Stringent salt conditions will ordinarily be less than about 1000 mM. usually 
less than about 500 mM. more usually less than about 400 mM. typically less tha,! about 
300 mM. preferably less than about 200 mM. and more preferably less than about 150 
mM. However, the combination of parameters is more important than the measure of 
any single parameter. See. e.g., Wetmur and Davidson (1968) J. Mol. Rinl 31:349- 370 
and Wetmur (1991) Critical Reviews in Biochemistry and Molecular Biology 26(3/4) 227- 
259. 

The term "identical" in the context of two nucleic acid sequences refers to 
the residues in the two sequences which are the same when aligned for maximum 
correspondence. Optimal alignment of sequences for comparison can be conducted, e.g., 
by the local homology algorithm of Smith and Waterman Adv. Appl. Math. 2: 482 
(1981). by the homology alignment algorithm of Needleman and Wunsch J. Mol. Biol. 
48:443 (1970), by the search for similarity method of Pearson and Lipman Proc. Natl. 
Acad. ScL (U.S.A.) 85: 2444 (1988). by computerized implementations of these 
algoriUims (GAP, BESTFIT. FASTA. and TFASTA in the Wisconsin Genetics Software 
Package. Genetics Computer Group. 575 Science Dr.. Madison. WI). or by inspection. 
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A nucleic acid "tag" is a selected nucleic acid with a specified nucleic acid 
sequence. A nucleic acid "probe" hybridizes to a nucleic acid "tag." In one typical 
configuration, nucleic acid tags are incorporated as labels into biological libraries, and the 
tag nucleic acids are detected using a VLSIPS™ array of probes. Thus, the «tag- nucleic 
acid functions in a manner analogous to a bar code label, and the VLSIPS™ array of 
probes functions in a manner analogous to as a bar code label reader. A "list of tag 
nucleic acids" is a pool of tag nucleic acids, or a representation (i.e.. an electronic or 
paper copy) of the sequences in the pool of tag nucleic acids. The pool of tags can be, 
for instance, all possible tags of a specified length (Le., all 20-mers). or a subset therei,f. 

A set of nucleic acid tags binds to a probe with "minimal cross 
hybridization" when a single species (or "type") of tag in the tag set accounts for the 
majority of all tags which bind to an array comprising a probe species under stringent 
conditions. Typically, about 80% or more of the tags bound to the probe species are of a 
single species under stringent conditions. Usually about 90% or more of the tags bound 
to the probe species are of a single species under stringent conditions. Preferably 95% or 
more of the tags bound to the probe species are of a single species under stringent 
conditions. 

DETAILED DESCRIPTION OF THE INVENTION 
The invention provides methods of selecting and detecting sets of tag 
nucleic acids; In addition, the invention provides arrays of probe nucleic acids for 
detecting tag nucleic acids, sets of tag nucleic acids and cells which comprise tag nucleic 
acids. The tag nucleic acids, probe nucleic acid arrays and cells transformed with tag 
nucleic acids have a variety of uses. Most commonly, the tag nucleic acids of the 
invention are used to tag cells with known genotypic markers (mutants, polymorphisms. 
etc.), and to track the effect of environmental changes on the viability of tagged cells. ^ 

For instance, with tiie completion of the sequencing of S. cerevisiae, 
thousands of open reading frames (ORFs) have been identified. One strategy for ' 
identifying the function of the identified ORFs is to create deletion mutants for each ORF 
followed by analysis of the resulting deletion mutants under a wide variety of selective 
conditions. Typically, the goal of such an analysis is to identify a phenotype that reveals 
the function of the missing ORF. If the analysis were to be carried out for each deletion 
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mutant in a separate experiment, the required time and cost for monitoring the effect of 
altering an environmental parameter on each deletion mutant would be prohibitive. For 
instance, to identify ORFs which are required for synthesis of an amino acid, all of the 
thousands of ORF deletion mutants would be individually tested for the ability of the 
mutant to grow in media lacking the amino acid. Even if the analysis were earned out in 
a parallel fashion using, e.g.. 96-well plates, the effort required to plate, organize, label 
and track each clone would be prohibitive. The present invention provides a much more 
cost-effective approach to screening cells. 

In the methods of the invention, all of the thousands of deletion mutants 
described above can be tested in parallel in a single experiment. The deletion mutants are 
each tagged with a tag nucleic acid, and the deletion mutants are then pooled. The 
pooled, tagged deletion mutants are then simultaneously tested for their response to an 
environmental stimulus (e.g.. growth in medium lacking an amino acid). The deletion 
cell-specific tags are then read using a probe array such as a VLSIPS™ array. Thus, by 
15 analogy, the deletion cell-specific nucleic acid tags act as bar code labels for the cells and 
the VLSIPS™ array acts as a bar code reader. 

While the example above specifically discusses labeling yeast cells, one of 
skill will appreciate that essentially any cell type can be labeled with the nucleic acid tags 
of the invention, including prokaryotes, eukaryotes, and archaebacteria. Also, essentially 
20 any vims can be similarly labeled, as can cellular organelles with nucleic acids 

(mitochondria, chloroplasts, etc.). In fact, labeling by tag nucleic acids and detection by 
probe arrays is not in any way limited to biological materials. One of skill will recognize 
that many other compositions can also be labeled by nucleic acid tags and detected with 
probe arrays. Essentially anything which benefits from the attachment of a label can be 
25 labeled and detected by the tags, arrays, and methods of the invention. For instance, 
high denominational currency, original works of art, valuable stamps, significant legal 
documents such as wills, deeds of property, and contracts can all be labeled with nucleic 
acid tags and the tags read using the probe arrays of the invention. Methods of attaching 
and cleaving nucleic aeids to and from many substrates are well known in the art, 
30 including glass, polymers, paper, ceramics and the like, and these techniques are 
applicable to the nucleic acid tags of the invention. 
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One of skill will also appreciate that while many of the examples herein 
describe the use of a single tag nucleic acid to label a cell, multiple tags can also be used 
to label, any cell. e.g.. by cloning multiple nucleic acid tags into the cell. Similarly, 
multiple nucleic acid tags can be used to label a substance such as those described above 
Indeed, multiple labels are typically preferred where the object of the nucleic acid tags is 
to detect forgery. For instance, the nucleic acids of the invention can be used to label a 
high denomination currency bill with hundreds, or even thousands, of distinct tags, such 
that visualization of tiie hybridization pattern of the tags on a VLSIPS™ array provides 
verification that the currency bill is genuine. 

In certain embodiments, multiple probes bind to unique regions on a single 
tag. I" these embodiments, the probes are typically relatively large. about 50 . 
nucleotides or longer. Probes are selected such that each probe binds to a single region 
on a single tag. The use of tags which bind multiple probes increases the informational 
content of hybridization reactions by providing making it possible to monitor multiple 
15 hybridization events simultaneously. 

One of skill will also appreciate that it is not necessary to hybridize a tag 
direcUy to a probe array to achieve essentially the same effect. For instance, tag nucleic 
actds are optionally (and preferably) amplified, e.g., using PGR or LCR or other known 
amphficauon techniques, and the amplification products (-amplicons") hybridized to the 
array. For instance, a nucleic acid tag optionally includes or is in proximity to PGR 
pnmer binding sites which, when, amplified using standard PGR techniques, amplifies the 
tag nucleic acid, or a subsequence thereof. Thus, cells, or oti,er tagged items can be • 
detected even if the tag nucleic acids are present in very small quantities. One of skill 
W.U appreciate that a single molecule of a nucleic acid tag can easily be detected after 
amplification, e.g., by PGR. The complexity reduction from amplifying a selected 
mixture of tags (/.... there are relatively few amplicon nucleic acid species as compared 
to a pool of genomic DNA) facilitates analysis of the mixture of tags. 

In one preferred embodiment, tags are selected such that each selected tag 
has a complementary selected tag. For example, if a tag is cloned into an organism, the 
tag can be amplified using LGR, PGR or other amplification methods: The amplified tag 
IS often double-stranded. In preferred embodiments, tag sets which include 
complementary sets of tags have corresponding probes for each complementary tag. Both 
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strands of a double-stranded tag amplification product are separately monitored by the 
probe array. Hybridization of each of the strands of the double-stranded tag provides an 
independent readout for the presence or absence of the tag nucleic acid in a sample. 

Selection of Tae nu cleic acids. 

This invention provides ways of selecting nucleic acid tag sets useful for 
labeling cells arid other compositions as described above. The tag sets provided by the 
selection methods of the invention have uniform hybridization characteristics {i.e., similar 
thermal binding stability to complementary nucleic acids), making the tag sets suitable for 
detection by VLSIPS™ and other probe arrays, such as Southern or northern blots. 
Because the hybridization characteristics of the tags are uniform, all of the tags in the set;' 
are typically detectable using a single set of hybridization and wash conditions. As 
described in the Examples below, various selection methods of the invention were used to 
generate lists of about 10,000 suitable 20-mer nucleic acid tags from all of possible 20- 
mer sequences (about 1,200,000,000,000). The synthesis of a single array with 10,000 
probes complementary to the 10,000 tag nucleic acids {i.e., for the detection of the tags) 
was carried out using standard VLSIPS™ techniques to make a VLSIPS™ array. 

Desirable nucleic acid tag sets have several properties. These include, 
inter alia, that the hybridization of the tags to their complementary probe {i.e., in the 
20 VLSIPS™ array) is strong and uniform; that individual tags hybridize only to their 
complementary probes, and do not significantly cross hybridize with probes 
complementary to other sequence tags; that if there are constant regions associated with 
the tags {e.g., cloning sites, or PGR primer binding sites) that the constant regions do not 
hybridize to a corresponding probe set. If the selected tag set has the described 
25 properties, any mixture of tags can be hybridized to a corresponding array, and the 
absence or presence of the tag can be unambiguously determined and quantitated. 
Another advantage to such a tag set is that the amount of binding of any tag set can be 
quantitated, providing an indication of the relative ratio of any particular tag nucleic acid 
to any other tag nucleic acid in the tag set. 

The properties ouflined above are obtained by following some or all of the 
selection steps outiined below for selection of tag sequence characteristics. 
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(1) Determine all possible nucleic acid tags of a selected length, or with 
selected hybridization properties. Although the examples below provide ways of selecting 
tags from pools of tags of a single length for illustrative purposes, one of skill will 
appreciate that.the tags can have different lengths, e.g.. where the tags have the same (or 
closely simUar) melting temperatures against perfectly complementary targets. One of 
skUl will also appreciate that a subset of all possible tags can be used, depending on the 
application. For instance, where tags are used to detect an organism, 20 mers which 
either occur in the organism's genome, or do not occur in the organism's genome can be 
used as a starting point for a pool of potential nucleic acid tags. For instance, the entire 
genome of S. cerevmae is available. In certain embodiments, endogenous sequences 
{e.g. . those which appear in ORFs) can be used as tags, obviating the need for cloning 
tags into the organism. Thus, all possible 20-mers from the genome are determined from ' 
the genomic sequence, and 20-mers with desired hybridization characteristics to probes 
are selected. Alternatively, where tags are cloned into the organism, it is occasionally 
preferable to eliminate all 20-mers which appear naturally in the genome from 
consideration as tag sequences, so that introduced tag sequences are not confused with 
endogenous sequences in hybridization assays. 

The selection of the length of the nucleic acid tag is dependent on the 
desired hybridization and discrimination properties of the probe array for detecting the . 
tag. In general, the longer the tag. the higher the stringency of the hybridization and 
washes of the hybridized nucleic acids on the array. However, longer tags are not as 
easily discriminated on the array, because a single mismatch on a long nucleic acid 
duplex has less of a destabilizing effect on hybridization than a.single mismatch on a 
short nucleic acid duplex. It is expected that one of skill is thoroughly familiar with the 
theory and practice of nucleic acid hybridization to an array of nucleic acids. In addition 
to the patents and literature cited supra in regards to the synthesis of VLSIPS~ anays. 
Gait, ed. Oligonucleotide Synthesis: A Practical Approach, IRL Press, Oxford (1984);' 
W.H.A. Kuijpers Nucleic Acids Research 18(17), 5197 (1994); K.L. Dueholm J. Org. . 
Chem. 59, 5767-5773 (1994); S. Agrawal (ed.) Methods in Molecular Biology, volume 
20; and Tijssen (1993) Laboratory Techniques in biochemistry and molecular 
biology-hybridization with nucleic acid probes, e.g.. part I chapter 2 "overview of 
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principles of hybridization and the strategy of nucleic acid probe assays", Elsevier, New 
York provide a basic guide to nucleic acid hybridization. 

Most typically, tags are between 8 and 100 nucleotides in length and 
preferably between about 10 and 30 nucleotides in length. Most preferably, the tags are 
between 15 and 25 nucleic acids in length. For example, in one preferred embodiment 
the nucleic acid tags are about 20 nucleotides in length. 

(2) The tags are selected so that there is no complementarity between any 
of the probes in an anay selected to hybridize to the tag set and any constant tag region 
(constant tag regions are optionally provided to provide primer binding sites. . g for 
PGR amplification of the remainder of the tag. or to limit the selected tags as described 
below). In other words, the complementary nucleic acid of the variable region of a 
aucleic acid tag cannot hybridize to any constant region of the nucleic acid tag. One of 
skill will appreciate that constant regions in tag sequences are optional, typically being 
used where a PGR or other primer binding site is used within the tag. 

(3) The tags are selected so that no tag hybridizes to a probe with only one 
sequence mismatch (all tags differ by at least two nucleotides). Optionally, tags can be 
selected which have at least 2 mismatches. 3 mismatches. 4 mismatches 5 mismatches or 
more to a probe which is not perfectly complementary to the tag, depending on the 
application. Typically, all tag sequences are selected td hybridize only to a perfectiy 
complementary probe, and the nearest mismatch hybridization possibility has at least two 
hybndization mismatches. Thus, the tag sequences typically differ by at least two 
nucleotides when aligned for maximal correspondence. Preferably, the tags differ by 
about 5 nucleotides when aligned for maximal correspondence (e.^.. .where the tags are 
20-mers). 

The tags are often selected so that they do not have identical runs of 
nucleotides of a specified length. For instance, where the tags are 20-mers. the tags are 
preferably selected so that no two tags have runs of about 9 or more nucleotides in 
common. One of skill wiU appreciate tiiat the length of prohibited identity varies 
depending on the selected length of the tag. It was empirically determined that cross- 
hybnd.zation occurs in tag sets when 20-mer tags have more than about 8 contiguous 
nucleotides in common. 
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(4) The tags are selected so that there is no secondary structure within the 
complementary probes used to detect the tags which are complementary to the tags 
Typically, this is done by eliminating tags from a selected tag set which have 
subsequences of 4 or more nucleotides which are complementary. 

(5) The tags are selected so that no secondary structure forms between a 
tag and any associated constant sequence. Self-complementary tags have poor 
hybndization properties in arrays, because the complementary portions of the probes (and 
correspondmg tags) self hybridize (i.e., form hairpin structures). 

(6) The tags are selected so that probes complementary to the tags do not 
hybndize to each other, thereby preventing duplex formation of the tags in solution 

(7) If there is more than one constant region in the tag. the constant regions 
of the tag are selected so that they do not self-hybridize or form hairpin structures 

(8) Where the tags are of a single length, the tags are selected so that they 
have roughly the same, and preferably exactly the same overall base composition (/ e 

e same A^T to G^^C ratio of nucleic acids). Where the tags are of differing length's 
the AH.T to G+C ratio is determined by selecting a thermal melting temperature for the 
^gs and selectmg an A^T to G+C ratio and p,obe length for each tag which has the 
selected thermal melting temperature. 

"hat there are a variety of possible ways of 
^rformmg the above selecttoh steps. Most typicaUy, se.«tio„ «ep, are performed using 
.mp.= computer progratns to perform the selection in each of the steps outlined above- 
however, all of the steps are optionally performed manually. -n» following strategies 'are 
provtded for exemplary purposes; one of sldU wiU recognize U.t a variety of simiL 
strategies can be used to achieve simiUir results. 

In one embodiment, secondary structure was prevented within the tag. and 
hybndtzauon within or between pair, o, complementary probes (goals 4, 5, and 6 Le, 

~T " '^'^^ " "'"^"""^ ^ dynamically 
excluded when any of the following properties were met: 

oven, """"""""'^ ''Sions of 4 or more bases, including those which 

-e^ap m s«,uence, a«. 4 mers which are self complementary. To prevent tag variable 
sendees f™„ hybridizing to the constant printer sequences, regions of 4 or more bases 
m a tag Which are complementary to 4 base sequences fully or partially contained in the 
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constant sequence which were separated by at least 3 bases (/.... the minimal separation 
typicaUy required for hairpin formation) were prohibited. 

(b) To assure uniformity of hybridization strength, runs of 4 mers compri^ only 
of 4 As. 4 Ts or 4 G or C residues were prohibited. Excluding runs of T/A and G/A is 
also desirable. 

Further selection is optionally performed to refine aspects of the selection 
steps ouUined above. For instance, to select tags which are less likely to cross-hybridize 
more fixed or restricted bases can be added for alignment purposes, the tags can be ' 
lengthened, and additional coding requirements can be imposed. In one embodiment the 
tags selected by the method above is performed, and a subset of tags with reduced ' 
hybndization is selected. For example, a first tag from the set generated above is 
selected, and a second tag is selected from the tag set. If the second tag does not cross- 
hybndize with the first tag the second tag remains in the tag set. If it does cross- 
hybridize, the tag is discarded. Thus, each tag from the group selected by the methods 
oudmed above is compared to every other tag in the group, and selected or discarded 
based upon comparison of hybridization properties. This process of comparison of one 

to every other possible tag in a pool of tags is referred to as pairwise comparison. 
Simdar to the steps outlined above, cross-hybridization can be determined in a dynamic 
programming method as used for the sequence alignment above. 

- Refinement of the above methods include accounting for the differences in 
destabUization caused by positional effects of mismatches in the probe:tag duplex The 
total number of mismatches is not the best estimate of hybridization potential because the 
amount of destabilization is highly dependent on both the positions and the types of the 
mismatches; for example, two adjacent mismatched bases in a 20 nucleotide duplex is 
generally less destabilizing than two mismatches spread at equal intervals. A better • 
estimate of cross-hybridization potential can be obtained by comparing the two tags 
directiy using dynamic programming or other methods. In these embodiments, a set of 
tag sequences obeying the following rules (in which the presence of a constant region in 
the tags is optional) is generated: 

(A) All tags are the same length. N and have similar base composition 
Certain runs of bases and potential hairpin structures are prohibited isee above). 
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(B) No two tag sequences contain an identical subsequence of length n, for 
some threshold length n. The second rule allows fast screening of the majority of cross- 
hybridizing probes (the selection method is linear), leaving a short list for which each 
pair of probes is compared for similarity (this takes the proportional to the square of the 
number of probes). The method is essentially an alphabetical tree search with the 
addition of an array to keep track of which n-mers have been used in previously 
generated tags. Each time the addition of a base to the growing tag creates an n-mer 
already used in a previous tag. the method backtracks and tries the next value of the base. 

(C) In this step, pairs of tags are compared to each other using a more 
sophisticated hybridization energy rule. For each pair of tags, the energy of 
hybridization of each tag to the complement of the other is calculated. If the energy 
exceeds a certain threshold, one of the tags is removed from the list. Probes are removed 
until there are no pairs within the list exceeding the threshold. For instance, in one 
embodiment, the energy rule is as follows: one point for a match (two adjacent matching 
base pairs), -2 points for errors in which a single base is bulged out on one strand, and -3 
for all other errors including long and asymmetric loops. The highest scoring aligmnent 
between each pair of tags was detemiined using a dynamic programming algorithm with 
a preprocessor requiring a short match of at least 5 bases to initiate comparison. 

More sophisticated energy rules can be incorporated in this framework by 
refming the hybridization rules. In addition, more complex rules for calculating the 
hybridization energy can be used in any of the above procedures. See. Vesnaver et al. 
(1989) Proc. Natl. Acad. Set. USA 86, 3614-3618; Wetmur (1991) Critical Reviews in 
Biochemistry and Molecular Biology 26(3/4), 227-259, and Breslauer et al. (1986) Proc. 
Natl. Acad. Sci. USA 83, 3746-3750. 

The set of tags chosen using pairwise comparison is not unique. It depends on 
the order of evaluation of the tags. For instance, if the tags are evaluated in alphabetical 
order, the final list contains more tags beginning with A than with T. A more 
sophisticated approach to generating the largest maximal set of such tags can also be 
used involvmg Debruijn sequences (sequences in which eveiy n-mer for some length n 
occurs exactly once). For example, a Debruijn sequaice incorporating all n-mers could 
be divided mto 20 mers with overlaps of n-1. giving the maximal number of 20 mers 
which do not have an nmer in common. This procedure is modified to take into account 
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the other goals for tags outlined above. For example, runs of bases are typically removed 
from the original Debruijn sequence, and 20 mers with imbalanced base composition or 
palindromes (which occur on a length scale greater than n) are deleted in a post- 
processing step. 

Many alternative procedures for Tag selection and exclusion are possible. 
For example, all pairwise energies can be computed before discarding any potential tags, 
and the tag which has the most near-matches exceeding the energy threshold can be 
discarded. The remaining tag with the most near-matches (not including those to tags 
which have already been discarded) can be discarded. This process is repeated until there 
are no near-matches remaining. 

For example, in one preferred embodiment of the above selection methods, 
the lags lack a constant region. Tags are selected by selecting all possible n-mers (e.g., 
20-mers), and eliminating sequences which: 

(i) have runs of 4x where x is an A, T. C, or G (e.g.. AAAA); 

(ii) have secondary structure in which a run of 4 contiguous nucleotides have a 
complementary matching run of 4 nucleotides within the tag; or 

(iii) have a 9 base subsequence (or other selected number of subsequences, 
typically from 5 to 15) in common with any other tag. 

All tags are then selected to have the same GC content, thereby providing all tags with 
similar melting temperatures when bound to a complementary probe. Performing the 
above steps limits the number of tags in the tag set to a pool of about 50,000 potential 
tags where the tags are 20-mers. 

A pair-wise selection strategy is then performed to yield a final set of tags. 
In the pair-wise comparison, a first tag is compared to every other tag in the tag set for 
hybridization to the complement of the first tag. If the first tag binds to a target witii a 
hybridization threshold a selected value higher than every otiier tag in the potential set it 
is kept. If another tag in the potential tag set binds to the complement of the first tag 
with a hybridization energy above the selected tiireshold, the first tag is discarded. This 
process is repeated for every tag remaining in the pool of potential tags. An exemplar 
computer program written in «C" is provided in Example 4 (Tags895.ccp) which 
performs the above selection steps. Varying the threshold resulted in sets of 0 to 50.000 
tags. In one preferred embodiment. 9,000 tags were generated. 
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More generally, tags (or probes complementary to the tags) are selected by 
ehmmating tags which cross-hybridize (bind to the same nucleic acid with a similar 
energy of hybridization). Tags bind complementary nucleic acids with a similar energy 
of hybridization when a complementary nucleic acid to one tag binds to another tag with 
an energy exceeding a specified threshold value, e.g., where one tag is a perfect match to 
the probe, a second tag is excluded if it binds the same probe with an energy of 
hybridization which is similar to the energy of hybridization of the perfect match probe 
Typically if the second tag binds to the probe with about 80-95% or greater the energy of 
a perfectly complementary tag, or more typically about 90-95% or greater or most 
typically about 95% or greater, the tag is discarded from the tag set. The' calculated 
energy can be based upon the stacking energy of various base pairs, the energy cost for a 
loop m the hybridized probe-tag nucleic acid chain, and/or upon assigned values for 
hybndization of base pairs, or on other specified hybridization parameters. In Example 2 
below, tags were selected by eliminating similar tags from a long list of possible tags 
based upon hybridization properties such as assigned stacking values for tag:probe 
hybrids. 

Methods which do not involve pairwise comparison are also used. In one 
embodiment, the tags were selected to comprise a constant portion and a variable portion 
The vanable portion of the tag sequence was limited to sequences which comprise no 
more than 1 C residue. The constant region of the tag sequence was chosen to be 
3'(ACTC),CC. This selection of specific sequences satisfies goal 7 above (the constant 
region was selected such that it is not self-complementary). One of skill will recognize 
that other constant regions can be selected, for instance where a primer binding site or 
restncuon endonuclease site is incorporated into the tags. 

Goal 2 is also met, because the probe (which is complementary to the 
variable portion of the tag) has no contiguous region of hybridizing bases with the 
exception of a single AGT or TGA sequence present on some of the probes, and even 
these sequences are not adjacent to the variable region of the tag where the primary 
hybndization occurs. In order to meet goals (1) and (8). the tags are selected to have the 
same length and the same total G+C content. 

To prevent cross-hybridization between a tag and probes complementary to 
other tags, a set of tags which could not be aligned with less than.two errors was chosen. 
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wherein an "error" is either a mismatch hybridization or an overhanging nucleotide. This 
was done by fixing the sequence of the bases at the ends of the tags. In particular, the 
bases at the ends of the tags were constrained to be the same. The residues at the ends 
were constrained to be the same. In particular, the bases were selected to start at the 5' 
end with the residues GA, and to end at the 3' end with either an A or T residue, 
followed by a G residue. This arrangement prevents tags from matching probes with a 
single overhanging nucleotide and no other errors, because an overhang forces either a G- 
A or a G-T mismatch. This arrangement also prevents tags from matching with a single 
deletion error, because a single deletion would cause the probe and the tag to be 
misaligned at one end, causing a mismatch. One of skill will appreciate that this strategy 
can be modified in many ways to yield equivalent results, e.g.. by selecting bases to yield 
C-T or C-A mismatches. 

To prevent single mismatch errors in the tag-probe hybridization the next to 
last base from the 5' end was selected so that the number of As plus the number of Gs in 

15 the variable region is even (as noted above, the next to last base from the 5" end is either 
a T or an A). This base acts in a manner analogous to a parity bit in coding theory by 
requiring at least two differences exist between any two tags in the tag set. This is true 
because the GC content of all of the selected tags is the same (see. above); therefore, any 
base differences in the variable region have to involve the substitution of G and C 

20 residues, or T and A residues. However, the substitution of less than two bases leads to 
an odd number of G+ A residues. Thus, at least two bases differ between any two tags in 
the tag set. satisfying (3) above. Similarly, the strategy could be varied, e.g., by 
selecting a different residue in the tag as the parity base which assigns whether the A-t-G 
content of the tag is even, or by adjusting the strategy above to yield an even number of 

25 T+G residues. 

A computer program in the standard programming language "C was 
written to perform each of the selection steps. For completeness, the program is provided 
as Example 3 below, but it is expected that onfe of skill can design similar programs, or 
perform the selection steps outlined above manually to achieve essentially similar results. 
30 Tags.ccp uses, a pruned tree search to find all sequences meeting the goals above, rather 
than testing every sequence of a selected length for the desired sequence characteristics. 
While this provides an elegant selection program with few processing steps, one of skill 
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will recognize that other programs can be used which test every potential tag for a desired 
sequence. Tags.ccp selects tag sets depending upon a variety of parameters including the 
constant sequence, the variable sequence, the GC content, the goal that the A+G ratio be 
even, and the relationship of the constant and variable regions in the tags of the tag set. 
For instance, in one experiment, a constant and a variable region were selected. The 
variable sequence length was selected to be 15 nucleotides in length, the G+C content of 
the variable region selected to be 7 nucleotides, and the A+G total number of bases was 
selected to be even, with a pattern of ??N„[AT]?, where ? is a selected fixed base. The 
parameters resulted in a set of about 8,000 tag sequences. 

More generally, the problem of designing a set of tag sequences which do 
not cross-hybridize has strong similarities to the problem of designing error-correcting 
codes in coding theory. The primary difference is that insertions and deletions do not 
have a correlate in coding theory. This problem, is accounted for as shown above by 
having constant regions within the probe sequences. The strategy is generalized by 
changing the locaUon of the parity bit, the required parity, or the locations of the constant 
regions. More complex codes are also useful, for example codes which require more 
differences between pairs of tags. See. Blahut (1983) Theory and Practice of Error 
Control Codes Addison-Wesley Publishing Company, Menlo Park, CA. 

One of skill will also recognize that pairwise comparison methods are used 
in conjunction with any other selection method. For instance, the tags generated 
according to specified rules such as those implemented by tags.ccp can be further selected 
using any of the pairwise comparison methods described herein. 

Svnthesv! of OUpo nu cleotide Mrmy c 

Oligonucleotide arrays are selected to have oligonucleotides complementary 
to the tag nucleic acids described above. The synthesis of oligonucleotide arrays 
generally is known. The development of very large scale immobilized polymer synthesis 
(VLSIPS™) technology provides methods for arranging large numbers of oligonucleotide 
probes in very small arrays. Pirrung et al., U.S. Patent No. 5,143,854 (see also PCT 
Application No. WO 90/15070). McGall et aL, U.S. Patent No. 5,412,087. Chee et al. 
SN PCT/US94/ 12305. and Fodor et al, PCT Publication No. WO 92/10092 describe 
methods of forming vast arrays of oligonucleotides using, for example, light-directed 
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synthesis techniques. See also, Fodor et oL (1991) Science 251:767-777; Llpshutz er aL 
(1995) BloTechniques 19(3): 442-447; Fodor er i3/. (1993) Nature 364: 555-556; and 
Medlin (1995) Environmemal Health Perspectives 244-246. 

As described above, diverse methods of making oligonucleotide arrays are 
known; accordingly no attempt is made to describe or catalogue all known methods. For 
exemplary purposes, light directed VLSIPS^ methods are briefly described below. One 
uf skill will understand that alternate methods of creating oligonucleotide array?;, kucH as 
spotting and/or flowing reagents over defined regions of a soUd substrate, bead based 
methods and pin-based methods are also known and applicable to the present invention 
{See, e.g., US Pat. No. 5,384,261, incorporated herein by reference for all purposes). 
In the methods disclosed in these applications, reagents are typically delivered to the 
substrate by flowing or spotting polymer synthesis reagents on predefmed regions of the 
solid substrate. 

Light directed VLSIPS™ methods are found, e.g., in U.S. Patent No. 
5,143,854 and No. 5,412,087. The light directed methods discussed in the '854 patent 
Q^ically proceed by activating predefined regions of a substrate or solid support and then 
contacting the substrate with a presdected monomer soludon. The predefined regions are 
activated with a light source, typically shown through a photolithographic mask. Other 
regions of the substrate remain inactive because they are blocked by the mask from 
illumination. Thus, a light pattern defines which regions of the substrate react with a 
given monomer* By repeatedly activating different sets of predefined regions and 
contacting different monomer solutions with the substrate, a diverse array of 
oligonucleoddes is produced on the substrate. Other steps, such as washing unreacted 
monomer soludon from the substrate, are used as necessary. 

The surface of a solid support is typically modified with linking groups 
having photolabile protecting groups (e.g., NVQC or MeNFoc) and illuminated through a 
photolithographic mask, yielding reactive groups (e.g., typically hydroxyl groups) in the 
illuminated regions. For instance, during oligonucleotide synthesis, a 3*-0- 
phosphoramidite (or other nucleic acid synthesis reagent) activated deoxynucleoside 
protected at Uie 5'-hydroxyl with a photolabile group) is presented to the suifece and 
coupling occurs at sites that were exposed to light in the previous step* Following 
capping, and oxidation, the substrate is rinsed and the surfece illuminated through a 
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second mask, to expose additional hydroxyi groups for coupling. A second 5'-protected, 
3*-0-phosphoramidite activated deoxynucleoside (or other oligonucleotide monomer as 
appropriate) is then presented to the resulting array. The selective photodeprotecUon and 
coupling cycles are repeated until the desired set of oligonucleotides is produced. 

In addition to VLSIPS™ arrays, other probe arrays can also be made. For 
instance, standard Southern or northern blotting technology can be used to fix nucleic acid 
probes to various substrates such as papers, nitrocellulose, nylon and the like. Because 
the formation of large arrays using standard technologies is difficult, VLSIPS™ arrays are 
preferred. 

Mqkjnfr Tae nucleic acids and Olif o nucleotiApa to he Counled into Armys: SvnthPinx nf 
T^St nucleic acids: Clonine of Tn ff Nucleic Acids into Cells 

As described above, several methods for the synthesis of oligonucleotide 
arrays are know. In preferred embodiments, the oligonucleotides are synthesized directly 
on a solid surface as described above. However, in certain embodiments, it is useful to 
synthesize the oligonucleotides and then couple the oligpnucleotides to the solid substrate 
to form the desired array. Similarly, nucleic acids in general (e.g., tag nucleic acids) can 
be synthesized on a solid substrate and then cleaved from the substrate, or they can be 
synthesized in solution (using chemical or enzymatic procedures), or they can be naturally 
occurring {i.e.. present in a biological sample). 

Molecular cloning and expression techniques for making biological and 
synthetic oligonucleotides and nucleic acids are known in the art. A wide variety of 
cloning and expression and in vitro amplification methods suitable for the construction of 
nucleic acids are well-known to persons of skill. Examples of techniques and instructions 
sufficient to direct persons of skill through many cloning exercises for the expression and 
purification of biological nucleic acids (DNA and RNA) are found in Berger and Kimmel, 
Guide to Molecular Cloning Techniques. Methods in Enzymology volume 152 Academic 
Press, Inc.. San Diego, CA (Berger); Sambrook et al. (1989) Molecular Cloning - A 
Laboratory Manual (2nd ed.) Vol. 1-3, Cold Spring Harbor Laboratory, Cold Spring 
Harbor Press, NY, (Sambrook); and Current Protocols in Molecular Biology, F.M. 
Ausubel et al., eds.. Current Protocols, a joint venture between Greene Publishing 
Associates, Inc. and John Wiley & Sons, Inc., (1994 Supplement) (Ausubel). Nucleic 
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acids such as Tag nucleic acids can be cloned into cells (thereby creating recombinant 
tagged cells) using standard cloning protocols such as those described in Berger, 
Sambrook and Ausbel. 

Examples of techniques sufficient to direct persons of skill through in vitro 
methods of nucleic acid synthesis and amplification of lags and probes in solution, 
including enzymatic methods such ias the polymerase chain reaction (PGR), the ligase 
chain reaction (LCR), QjS-replicase amplification (QBR), nucleic acid sequence based 
amplification (NASH A), strand displacement amplification (SDA), the cycling probe 
amplification reaction (CPR), branched DNA (bDNA) and other DNA arid RNA 
polymerase mediated techniques are known. Examples of these and related techniques are 
found in Berger, Sambrook, and Ausubel, as well as MuUis et aL, (1987) U,S. Patent 
No. 4,683,202; PCR Protocols A Guide to Methods atui Applications (Innis et aL eds) 
Academic Press Inc. San Diego, CA (1990) (Innis); Amheim & Levinson (Ctetober 1, 
1990); WO 94/11383; Vooijs et aL (1993) Am 7. Hum. Genet, 52: 586-597; C&EN 36- 
47; The Journal Of NIH Research (1991) 3, 81-94; (Kwoh et al. (1989) Proc. NatL Acad. 
Sci. USA 86, 1173; Guatelli et aL (1990) Proc. NatL Acad. ScL USA 87, 1874; Lomell et 
aL (1989) J. Clin. Chem 35, 1826; Landegren et aL, (1988) Science 241, 1077-1080; 
Van Brunt (1990) Biotechnology 8, 291-294; Wu and Wallace, (1989) Gene 4, 560; 
Sooknanan and Malek (1995) Bio/Technology 13, 563-564; Walker et aL Proc. NatL 
Acad. ScL USA 89, 392-396), and Barringer et aL (1990) Gene 89, 117. Improved 
methods of cloning in vitro amplified nucleic acids are described in Wallace et aL, U.S. 
Pat. No. 5,426,039. In one preferred embodiment, nucleic acid tags are amplified prior 
to hybridization with VLSIPS*^ arrays as described above. For instance, where tag 
nucleic acids are cloned into cells in a cellular library, the tags can be amplified using 
PCR. 

Standard solid phase synthesis of nucleic acids is also known. 
Oligonucleotide synthesis is optionally performed on commercially available solid phase 
oligonucleotide synthesis machines (see, Needham-VanDevanter et aL (1984) Nucleic 
Acids Res. 12:6159-6168) or manually synthesized using the solid phase phosphoramidite 
triester method described by Beaucage et. aL (Beaucage et. aL (1981) Tetrahedron Letts. 
22 (20): 1859-1862). Finally, as described above, nucleic acids are optionally . 
synthesized using VLSIPS^ methods in arrays, and optionally cleaved from the array. 
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The nucleic acids can then be optionally reattached to a solid substrate to form a second 
array where appropriate, or used as tag nucleic acids where appropriate, or used as tag 
sequences for cloning into a cell. 

The term "label" refers to a composition detectable by spectroscopic, 
photochemical, biochemical, immunochemical, or chemical means. For example, useful 
nucleic acid labels include 32P, 35S, fluorescent dyes, electron-dense reagents, enzymes 
(e.g., as commonly used in an ELISA), biotin, dioxigenin, or haptens and proteins for 
10 which antisera or monoclonal antibodies are available. 

A wide variety of labels suitable for labeling nucleic acids and conjugation 
techniques are known and are reported extensively in both the scientific and patent 
literature, and are generally applicable to the present invention for the labeling of tag 
nucleic acids, or . amplified tag nucleic acids for detection by the arrays of the invention. 
15 Suitable labels include radionucleotides, enzymes, substrates, cofactors, inhibitors, 
fluorescent moieties, chemiluminescent moieties, magnetic particles, and the like. 
Labeling agents optionally include e.g., monoclonal antibodies, polyclonal antibodies, 
proteins, or other polymers such as affinity matrices, carbohydrates or lipids. Detection 
of tag nucleic acids proceeds by any known method, including immunoblotting, tracking 
of radioactive.or bioluminescent markers. Southern blotting, northern blotting, 
southwestern blotting, northwestern blotting, or other methods which track a molecule 
based upon size, charge or affinity. The particular label or detectable group used and. the 
particular assay are not critical aspects of the invention. The detectable moiety can be 
any material having a detectable physical or chemical property. Such detectable labels 
25 have been well-developed in the field of gels, columns, solid substrates and in general, 
labels useful in such methods can be applied to the present invention. Thus, a label is 
any composition detectable by spectroscopic, photochemical, biochemical, 
immunochemical, electrical, optical or chemical means. Useful labels in the present 
invention include fluorescent dyes (e.g., fluorescein isothiocyanate, Texas red, 
rhodamine, and the like), radiolabels (e.g., 3H, 1251, 35S, 14C, or 32P). enzymes (e.g., 
LacZ, CAT, horse radish peroxidase, alkaline phosphatase and others, commonly used as 
detectable enzymes, either as marker gene products or in an ELISA), nucleic acid 
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intercalators (e.g., ethidium bromide) and colorimetric labels such as colloidal gold or 
colored glass or plastic (e.g. polystyrene, polypropylene, latex, etc.) beadis. 

The label is coupled directly or indirectly to the desired nucleic acid 
according to methods well known in the art. As indicated above, a wide variety of labels 
are used, with the choice of label depending on the sensitivity required, ease of 
conjugation of the compound, stability requirements, available instrumentation, and 
disposal provisions. Non radioactive labels are often attached by indirect means. 
Generally, a ligand molecule (e.g., biotin) is covalentiy bound to a polymer. The ligand 
then binds to an anti-ligand (e.g. , streptavidin) molecule which is either inherently 
detectable or covalentiy bound to a signal system, such as a detectable enzyme, a 
fluorescent compound, or a chemiluminescent compound. A number of ligands and 
anti-ligands can be used. Where a ligand has a natural anti-ligand, for example, biotin, 
thyroxine, and Cortisol, it can be used in conjunction with labeled, anti-ligands. 
Alternatively, any haptenic or antigenic compound can be used in combination with an 
antibody. Labels can also be conjugated directiy to signal generating compounds, e.g., 
by conjugation with an enzyme or fluorophore. Enzymes of interest as labels will 
primarily be hydrolases, particularly phosphatases, esterases and glycosidases, or 
oxidoreductases, particularly peroxidases. Huorescent compounds include fluorescein and 
its derivatives, rhodamine and its derivatives, dansyl, umbelliferone, etc. 
Chemiluminescent compounds include luciferin, and'2,3-dihydrophthalazinediones, e.g., 
luminol. Means of detecting labels are well known to those of skill in the art. Thus, for 
example, where the label is a radioactive label, means for detection include a scintillation 
counter or photographic film as in autoradiography. Where the label is a fluorescent 
label, it may be detected by exciting the fluorochrome with the appropriate wavelength of 
light and detecting the resulting fluorescence, e.g.. by microscopy, visual inspection, via 
photographic film, by the use of electronic detectors such as charge coupled devices 
(CCDs) or photomuUipIiers and the like. For detection in VLSIPS™ arrays, fluorescent 
labels and detection techniques, particularly microscopy are preferred. Similarly, 
enzymatic labels may be detected by providing appropriate substrates for tiie enzyme and 
detecting the resulting reaction product. Finally, simple colorimetric labels are often 
detected simply by observing the color associated with the label. Thus, in various 
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dipstick assays, conjugated gold oft^ appears pink, while various conjugated beads 
appear the color of the bead, 

SubstnUes 

As mendoned above, depending upon the assay^ the tag nucleic acids, or 
probes complementary to tag nucleic adds can be bound to a solid surface. Many 
methods for immobilizing nucl^ acids to a variety of solid surfeces are known in the an. 
For instance^ the solid surface is optionally paper, or a membrane (e.g., nitrocellulose), a 
microliter dish (e.g., PVC, polypropylene, or polystyrene), a test lube (glass or plastic), a 
dipsdck (e.g, glass, PVC, polypropylene, polystyrene, latex, and the like), a 
microcentrifuge tube, or a glass, silica, plasdc, metallic or polymer bead or other 
substrate as described herein. The desired component may be covalently bound, or 
noncovalently attached to the substrate through nonspecific bonding. 

A wide variety of organic and inorganic polymers, both natural and 
synthetic may be employed as the material for the solid surface. Illustrative polymers 
include polyethylene, polypropylene, poly(4-methylbutene), polystyrene, 
polymethacrylate, poly(ethylene terephthalate), rayon, nylon, poly(vinyl butyrate), 
polyvinylidene difluoiide (PVDF), silicones, polyformaldehyde, cellulose, cellulose 
acetate, nitrocellulose, and the like. Other materials which are appropriate depending on 
the assay include paper, glasses, ceramics^ metals, metalloids, semiconductive materials, 
cements and the like. In addition, substances that form gels, such as proteins {e.g. , 
gelatins), lipopolysaccharides, silicates, agarose and polyacrylamides can be used. 
Polymers which form several aqueous phases, such as dextrans, polyalkylene glycols or 
surfactants, such as phospholipids, long chain (12-24 cart>on atoms) alkyl ammonium salts 
and the lite are also suitable. Where the solid surface is porous, various pore sizes may 
be employed depending upon the nature of the system. 

In preparing the surface, a plurality of different materials are optionally 
employed, e.g., as laminates, to obtain various properties. For example, protein 
coaungs, such as gelatin can be used to avoid non specific binding, simplify covalent 
conjugation^ enhance signal detection or the like. If covalent bonding between a 
compound and the surface is desired, the surfece will usually be polyfunctional or be 
capable of being polyfuncuonalized. Functional groups which may be present n the 
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surface and us«. for Unking -.^ 

groups, e^ylcnic g^ups. .y^yi „ ^^^^ ^ f J J™ 

^din. va^ous n^oas for „o„oova«en«v bi„.„. an assay co™,K>„e„, 



EXAMPLES 

v»vof,- ■ .'™'''°''°*'°*'^'''"'"P™"''=^''>'"»y''f"ta»t"'ion<>nlyand„otby 
way of a„„.atto„. Thoso of skill will readily reeognize a variety of noncrij 

parame^s which eouid be changed or modified u, yield essenaally simUar result. 

•r*'' «>"PI='» sequence of thes.cerevisiae genome is known. In U.e 
of se,u«,ci„g , he genome, thousands Of open reading frames, representing 

^ ■ Gene dismpdon is a powerful tool for determining the funcUon of unknown 

ORFS .„ yeast. Given the sequence of an ORP, it is possible to generate a deledo 
ustng standard gene disruption techniques. The deledon sttain is ^ grown un e a 

mtsstng ORP. However, .„d,v,dual analysis of thousands of deledon strains u> assess a 
large number of selective conditions is impractical. """sessa 
fo overcome this problem, individual ORP deleUons were tagged with a 
n.u.sh,ng molecular tag. The deletion specific tags were read by hybriltiont > 
^Sh denstty array of oUgonucleotide probes eompHsing probe sets complementary to L 

tagged deletion?.' ^"^-^^ " "PP^ach for generating 

ugged deletion strams that can be pooled and analyzed in parallel through selective 

growth assays. 

Individual deletion strains were generated using a PCR-targeting strategy 
(Baudin, Ozier-Kalogeropoulos et al 199"^ ^ » . , 

.n.. r , . ^"'^ 21(14): 3329-3330). ORF 

P^nc .olecular tags were incorporated during the transformation (Figure 2). Tagged 
delet.on strains were pooled and representative aliquots grown under different ^lectWe 
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array was .Hen washed and scanned a hi^y «n.i«.e conf J^i'™^' 
nor^al.^ signa, for each .a. „flecu U,e .eUdve abundance of «,e d,~Lr 
s«.ns .„ u,e poo,, -n,. f.«e« of .he deledon suan,. ,„ fte poo. Jd^n 'T 
con-paHn, d.e hyhH.^„„ pa^, ^ Jec„~r 

To K.. d,e feasibUio, of fte nK>Iec„lar ugging stmegy a list of 9 m. 

Table 1 



All Possible 20mera 
Prinury Filter 



Secondary fnlter 



Selection criteria 



None 



Properties of selected 
20mers 



Number of 20mcr» 
accepted 



1.2 X I0'» 



Nohaiipins > 4bp 
No single nucleotide runs > 4 bases 
SimiUr base composition (4-6 of each base) 
No common 9men 

I-ow stringency pair-wise analy.i. 



SimiUr Tm( + /-yC) 
Good hybridization 
No extreme homology 



Unique 



51,081 



51.081 

9, 1 05- > (4,500 
selected at random) 

2.643 

853 

Highest «ringency comparison. Very unique 

seouenr.. ''^^ comprising probes complementary to the tag 

^qu^ces was p^duced by standard „gH,-di,ec«d Vl^iPs- methods. l^r^2, 
' array of probes provides probe se. a. known ,oca.ons i„ U,e arrTy 

rz= :zr:: oT: :v:r ~ ~" - - 

Hybridi^Uon experiments with ,20 d ff „ ""^ 

Showed that the T Auorescemly labeled 20mer oligonucleotides 

Showed that the arrays are sensitive, quantitative, and highly specific. 
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specified «,r«hoId vai„e The " ""^ ««-«"^ ^ 

Of varto. .a. p.., and Z :ZZ ZZ^ ' ' ■ ^""^'"^ '"'-^ 

values for hybridization Of ba« J" ' --^ned 

.^ed „p„n .ybridi^rpi^^rr::'" r ' '^-^ ^-^'^ 

hybrids. ^ "^'e""" stacldng va]„« for ag:;She 

.a.^. Tags a.:r:L": r::: :r r " - - - 

wi.h an energy exceeding 30.^^^^ ™ ^ ""^ "'""^ '° — 
ideas in seiecungugs. L ideJlT^ "'""^'"^ '-ree disUnc. 

3) a ha^ .able » quickly find p«fec. n,a.ch segment. 
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'^"'•MngoM loop co« model for, he calculation of hybridization energy 

The calculated energy is based on .he su,cldn6 energy of various base 
Pam. a^d U«= energy c«, for a loop in .he chain. B.^.. ,he user can specify tta. u» 

Jl:rrr!;/.!°^^':° = ... X^energy 
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For example, if the following match occurs, ther. is a loop size of 1 on the first strand 
and 0 on the second strand. . Examination of the table reveals a loop penalty of 5 and a 
resultmg stacking energy of 14. ana a 

C 

target: A G G T A C G 
tag: T C C A T G C 

The Algorithms to Quickly Calculate the Energy of Hybridimion 

The hybridization energy can be calculated using either a recursive 
algonthm. or a dynamic programming algorithm. The recursive algorithm is fast if the 
energy cost for loops is large relative to the stacking energy. TT.e dynamic programming 
algonthm « fast if the energy cost for loops is small relative to the stacldng energy 



AAACAATTCTA 



CTAGAAGATAAGTACC 



ACAGG 



L^f^®"^ match region 
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Both energy calculation algorithms start out with 2 tags which have a 
multiple base perfect match sequence. They then calculate the energy of the perfect 
match sequence, then find the matches that lead to the highest energy before the perfect 
match region, and after the perfect match region. The total matching energy is the sum 
of these three energies. Since this model does not have an implied direction, the same 
algorithm can be used for both the before match energy, and the after match energy, by 
reversing the orders of the before match fragments. 
A recursive, heavily pruned algorithm 

The recursive algorithm tries all match trees, looking at all loop sizes that 
have a low enough cost that they can be paid for with the maximum possible energy for 
the remaining matches. Code for the algorithm is is follows- 

float tempEnergy. bestEnergy = 0; 
linkSet ^en, ^return Value NULL; 
short maxTagLoop; 

needed to pay for (h^ 'loopTil'"" " " """^ '"^ precalcuUCed *e minimum number of base pai,. 

i^^H //iis.oop^i:.oagtrtr^^^^^ > bo.herWithLoop[i]U]; 

the target PP + i + U & Urgetftp + j + !]))( // this tests that the tag matches 

Urget. basesLeftlnTag 1 baseslStrnT^gS^to^^^^^^ ^' 

is better than our previous best.... " " *^ ^"^"^ ^"^^^ P'^'' *e loop, and 

(sUcleingEnergylUrgetUplltUrge. n;^l^'iEL^Um^T::!iZ,y )( 

return Value = niakeNiBwLinkset( return Value ); 

retumValue-> energy = bestEnergy = tempEnergy; 

if( I = = 0 &&- j = = 0){ // add self and successor as link 

firstLink( return Value, 2, pp, tp ); 

}else{ // add successor as link 

firstLink( retumValue, 1, pp+i + 1, tp+j + l); 
addNewLink( retumValue, 1, pp, tp ); 

old one ^^^^ ^''^"^^ ^ ^ ^^M // new matching base pair is adjacent to 

previous best.... stacking energy pays for the loop, and is better than our 

en- > energy = bestEnergy = tempEnergy; 
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niakeFirstLinkOneLonger( en ); 

if( return Value != NULL ) farfree( return Value )• 

retumValue = en; 
}else // new energy to small, don't want it. just free it 

if( en != NULL ) farfree{ en ); 
>else{ // loop size is non-zero 

previous best.... " ^^^^ P^J^ »oop. and is better than our 

. . . ^ jf( (tempEnergy = ea*>eiier2v + 

staclangEnergy(target[tpl]ttarget[tp+j + l]] . loopEnergy(i]U]) > bestEnergy){ 

en- > energy = bestEnergy . tempEnergy; 

if( returnValue != NULL ) farfree( return Value )• 

retumValue = en: 
addNewLinlc( retumValue. 1. pp. tp ); 
}else // new energy to small, don't want it. just free it 
^ if( en ! = NULL ) farfree( en ); . 

} 

} 

} 

return return Value: 

} 

A dynamic programming algorithm 

The dynamic programming algorithm starts out by making a matrix of 
permitted or "legal" connections between the two fragments. 



CTAG AAGAT 



AAACAATTCTAACAGG 



AAGT ACC 



_____ 

^ {perfe ct match region 
Then, starting at the upper left comer, each legal connection is considered, and all 
previous bases are identified. Previous bases are any legal base pair in the rectangle to the 
upper left of the considered base. In the figure below, the first three legal matches have 
no previous legal connection except the (assumed) perfect match which occurs before the 
mismatch segment. 
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The values in those cells are replaced by the sum of the stacking energy, and the loop 
cost. 
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Then the process is repeated for each legal connection further out in the matrix* If there 
is more than one legal loop, die loop with the maximum value is used. When this 
process is finished, path that resulted in the maximum cell value is the best match. 
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A hash table to quickly find perfect match segments 

To speed up the comparison of one tag with all other tags in the list, the 
program uses a hash table which points to all occurrences of any given n-mer in the set of 
tags. The hash table is implemented as two arrays of structures which point to locations 
in tags. The first array is 4 to the n records long, and the second is the size of the 
complete list of tags. 



to t he n records long 



aaaag 



aaaat 



aaaaa 


aaaat 


^ aaatc 


aatca 


alcaa 


tcaaa 


caaaa 


aaaaa 
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gaaaa 
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aaaaa 


aaaac 


aaacg 


aacQO 


acggt 








Each arrow points to ttia 
next Instance of the same 
rwner. if there are no more 
tt points to "NULL* 



ttttg 



ttttt 



I One record for every n-mer in 
Ithe probe set 
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Thus, a file is created with a list of all of the tags (or probes) from which 
tags are to be selected. Typically, the tags are listed one per line, in a column. 
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e.g., with the heading "probe" or "tag." The above analysis is performed on the file, 
and an output file is produced based on the above method resulting in a list of tags in 
which no tag hybridizes to the complement of any other tag with a stacking energy which 
exceeds a specified threshold. 

The computer program tags.ccp, written in **C" and referred to above is 
provided below: 

^include <stdio.h> 

X^include <stdlib.h> 

^include <math.h> 

^include <alloc.h> 

char out[] = "TAGS.OUT"; 

char hout[] = "HIST. OUT"; 

char label[] = "ACGT"; 

Mefine BASES 4 //number of nucleotides 

//the following numbers include all the nonperiodic bases at the 

//end of the primer (in this case one C) in addition to the tag 

^defme PRETAG 0 //bases of primer preceding tag 

^define LENGTH 20 //length of tag plus pretag 

^define GCLIM 1 1 //max total of G's and C's allowed 

#defme ATLIM 1 1 //max totol of A*s and T's allowed 

Xfdefine AALIM 7 //max total of A's allowed 

^define CCLIM 6 //max total of C's allowed 

ift^define GGLIM 6 //max total of G's allowed 

^define TTLIM 7 //max total of T's allowed 

#define NUMERS 256 //number of fourmers 

J??"" ™ w ^'^^"8^ "^'^^^^ prohibited 

^define LNUM 1048576L //number of longmers 

/^define MXNUM ( 32768 / sizeofOnt)) 

short numVectors; 

int far ** IngVectors; 

int test_base(int current_base); 

int remove_base_daU(int current^base); 

int complement(int fourmer); 

double seq_to_int{void); 

int print_sequence(void); 

int pattem[LENGTH][BASESl; //identifies allowed bases each position 

int fourmers[NUMERS]; //identifies prohibited 4mers 

int sequencc[LENGTH]; //base at specified position • 

int fourseq[LENGTHl; //fourmers ending at specified position 

long longseq[LENGTH]; //longmers ending at specified position 

int basecnt[B ASESJ; //number of occurrences of each base to left of 

. //current position 

int spacmgstLENGTH]; //position past (or =) to which an occurrence of 

//a complementary fourmer to the one at this 

m c * let //position is prohibited 

FILE '*'outfiIe,'<'houtfiIe; 



mainQ 



) 
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inc ij,k,cl)s; //counters 

long temp»cjb; //temp storage of longmer 

ini cseqDLENGTHl = {2,0,0. 0,1,0,0,0,2.0,2^,2,3,2); //sample tag 

long xpas; //number of pas6es thm loop over 4 
long maxtag; //max number of different tags 
int current Jbase; //current sequence position within backtracking 
int tag^count; //count of acceptable tag sequences 
int pass; //flag for whether tag was acceptable 

int exflag; //flag indicaUng completion of all possible lags 

double cur^seq; //integer repreaeatation of current tag sequence 
double prev^seq; //integer rq;)ie6entation of previously tested tag 

int hi£togram[LBNui'Ui; //histogram of mismatches to cseq 
int mismatches; 
numVectors « (USfUM/MXNUM); 

IngVcctors = (int * *) faicalloc(num Vectors, si2fiof(int far 
if(lngVectors s^== NULL) { 

printf<'\nerror allcx:ating IngVectorsl"); 

exit(l); 

} 

for(i=0; i <numVectots; + + i) { 

IngV^sctorsri] = (int * ) faicalloc(MXNUM, sizeof(int)); 
if(ingVeciorsli] NULL) { 

printf("\netTOr allocating vectors: i = %d\n*',i); 

cxitd); 

} 

} 

//open output file 

outfile =a fopen(out,"wt"); 

ifCmJLL ouifile){ 

printf("\netTor opening output file ^SsVu^tOut); 

cxit(l); 

} 

//open houtput f>le 

houxfile « fopen(hout,"wt"); 

if(NULL houtfile) { 

pnntf('\nerror opening output file ^^aVn^yhout); 

fclo3e(outfile);e]dt(i}; 

} 

//initialize histogiam 

for(i=0; i< LENGTH; lustogiam[il « O; 
//initialize pattern array 

for<i«0; i<I-ENGTH; ++i) foT(k=0; k<BASES; ++k) patteroplM = I; 
//initialize 4nier array 

for<i=»0; KNUMERS; ++i) fourmersp] = 1000; 
//initialize 9mer array 

forO'^O; i<LMAT; + +i) lngVectorsO/MXNUM][i%MXNUM] « 0: 
//Runa marked 

fourmefs[0] ^ founners[S5] = fourmers[170] = fourmers(2551 = -i; 
//initialize sequence 

for(i=0; i< LENGTH: ++£) sequence[i] « 0; 

//initialize spacings to disregard incomplete words at beginning 

foi<i»0; £<3; ++i) spacings[i] = 1000; 

//Intdalize spacings to requite 6mer palindrome 

foca=-3; i<LENC5TH; + +i) spacings[il = 1+2; 



) 



> 
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//initialize base sequence* current position, base count,tag count etc. 
//start at "largest" sequence we know is wrong 
sequence[0] = -I; 
current_base = 0; 

5 basecnt[0i = basecnt[ll = basecnt[2] = basecnt[3] = 0; 

tag_count ~ 0; 
prev_seq = -I; 

//initialize fourseq for fourmers up to current_base 
10 //don*t need to since it is at zero and it will be reset 

//initialize fourmers for fourmers up to current_base 

//don't need to do anything because only one is at zero and 

// it will be removed shortly anyway 

//calculate maximum number of different tags 
15 foKmaxtag = l,i=0; i<LENGTH-PRFrAG-l; ++i) { 

maxUg *= BASES; 

} 

printfCVnmaxtag = %ld\n".maxtag); 

20 //THIS IS A DEBUG STATEMENT!! REMOVE LATER 

maxtag = (long) 10000000000; 

pass = 0; 

//until all probes are exhausted 
25 for(j=0. xpas = 0,exflag=0; exflag != 1 Sl8l xpas< maxtag; ++j) ( 

./* if(j-=4){ 'V 

+ + xpas; 
j-0; 

30 */ ^ 

//DEBUG 

//fprintf(outfile,"\nNOW: *); 
//print_sequence(); 

//backtrack to last incrementable base 
35 for(; sequence[current_base] BASES- 1 || (pass= = l &4& current base > 7) 

"Current_base) ( 

//remove current base data 
remove_base_data(curren t^base) ; 
//set base to zero 
40 sequence[current_base] = 0; 

} 

if(current_base < PRETAG) { 
exflag =1; 
continue; 

45 } 

//remove current base data 
remove_base_data(current_base); 
//increment current^base 
+ +sequence[current_base]; 
50 //error checking to ensure sequence is increasing 

//DEBUG 

//fprintf(outfile/\nUPD: "); 
/ / print_sequence() ; 
cur_seq = seq_^tojnt0; « 
55 if(cur_seq < = prev^seq) { 
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print_sequence() ; 

printf("\n\n!!ERRROR: current_base = %d, cur seq « %d, prev seq = %d i 
= %d\n ,current_base,cur_seq,prev_seq J); - ^ • j 

fclose(outfile) ;exit( 1); 

5 } 

prev_seq = cur seq; 
*/ " 

//update base data and test until reach end or failure 

for(pass = 1; current_base < LENGTH pass 1; ++ current base) 

pass = test_base(current_base); " 
//if testing was successful, print sequence 
— current_base; 

//extra check to be sure not repeating longmers 
if(pass 1) { 

//record prohibitions on 9mers ' 

for(cbs LMAT-l ; cbs < LENGTH; ++cbs) { 
c_b = iongseq[cbs]; 

if(lngVectors[c_b/MXNUM][c_b%MXNUMl!= 0) { 
9n «... ^ ^ , printf(-\n\n!!ERROR: already matching 9mer found: c b « 

lyj %ld, current^base = %d, longseq = %!d", 

c_b, cbs, longseq(cbsl); 
fcIose(outfile);exit( 1); 

} 

OS ^ 

^-^ //record longmer prohibitions 

foKcbs LMAT-1; cbs < LENGTH; ++ cbs) { 

c_b = longseq[cbs}; 
^ lngVectors[c_b/MXNUM][c_b%MXNUMJ= 1 ; 

//increment tag count 
+ + tag_count; 
//print tag sequence 
fprintf(outfile,"\n"); 
print_sequence() ; 

■^^ ^* //calculate mismatches with ref sequence, cseq 

mismatches = 0; 

for(i=0; i<LENGTH; ++i) if(cseq(i] != sequence[i]) + +mismatches; 
+ +histogram[misniatches]; 
if(misniatches < = 2) { 
40 fprintf(houtfile,"\n"); 

for(it=PRETAG; i<LENGTH; + +i) 
fprintf(houtfile,"95cMabeI[scquence[i]]); 

} 

*/ 

45 //fprintfCoulfile," OK!"); 

} 

//if exited on j > =MAX record error and exit 
if(i> smaxtag) 

50 printf("\n!!ERROR: exceeded allowable passes through primary loop\n"); 

//print tag count 

fprintf(outfile,"\n\nTag Count: %d\n",Ug_count); 
^. for(i=0; i < LENGTH; + +i) (print f(hout file/ \n%d mismatches: %d-,i,histogram[il); 

fclose(oiitfiIe); 



.) 
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I 

fclose(houtfile); 
retura(l); 

} 

int test base(int current base) 

{ " 

int comp; //complement 
long . c_b; 

//increment base coimt (must be accomplished before exiting this function) 
+ +basecnt[sequence[current_base]]; 
//fprintf(outfile/\nbase = %d ",current_base); 
//test base count, return upon failure 
if(basecnt[0] + basecnt(3] > ATLIM) { 

//fprintf(outfile," failed AT limit"); 

retum(O); 

} 

if(basecnt[l] -h basecnt[2] > GCLIM) { 

//fprintf(outfile/failed GC limit"); 
retuni(O); 

} 

if(basecnt[0] > AALIM) { 

//fprintf(outfile,"failed A limit"); 
retum(O); 

} 

if(basecnt[l] > CCLIM) { 

//^rintf(outfile, "failed C limit"); 
retum(O); 

. } 

if(basecnt[21 > GGLIM) { 

//fprintf(outfile, "failed G limit"); 
retum(O); 

} . 
if(basecnt[3] > TTLIM) { 

//fprintf(outfile," failed T limit"); 

retum(O); 

} 

/* 

//test if base matches patterns, return upon failure 
if{pattenj[current_basel[sequence[current_base]] != 1) ( 

//fprintf(outfile/failed pattern match"); 

retum(O); 

} 

//if at last base verify checksum 
if(current_base == LENGTH-l) 

if(0 ((basecnt[01-hbasecnt[2])%2)) { 

//fprintf(outfile, "failed checksum"); 

retum(0); 

} 

*/ 

//compute current 4mer 

if(current_base = = 0) fourseq[0] = sequencc[current_base]; 

^J'fl^""^^^"^^"^-^^! = (4*fourseq[current^base.ll)%256 + sequencelcurrem basel; 
//It. this IS a full 4mer check 4mer and return upon failure 
if(current_base > 2) { 

c mp complement(fourseq[current_base}); 



) 



) 
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if(founners[comp] < = current^base) { 

//fprintf(outfile,"fourraer failedicurt: %d, comp: %d, spacing: %d", 
// fourseq[cuiTent_base],coinp,founners[coinp]); 
retum(0); . 

5 } 
} 

//compute current 9mer if full 9iner 

if(current_base = = 0) longseq[0] = sequence(cuiTent_base]; 

else longscqCcurrent basc] = (4*longseq[current_base-l])%LNUM + sequence[cuiTeQt_basel; 
10 //if full longmer check 9iner and return upon failure " 

if(current_base > LMAT-2) { 

c_b = longseq[current_base]; 

if(lng Vectors[c_b/MXNUM][c_b %MXNUM] = = I ) { 
//printf("hello*); 

15 //fprintf(outfile, "longmer failed:c_b: %ld, lngVectors[c b]: 

%d",c_b,lngVectors[c_b)); . 

retum(O); 

> ' 

20 //record prohibitions on 4mer 

if(currenc_base > 2 ScSc founners[fourseq[current_base]] > spacings[current_base]) 
fourmers[fourseq[current_base)] = spacings[current_base]; 

//all tests passed!! Add new base: 

//fprintf(outfile. "passed! ■); 
25 //passed all tests 

retuni(l); 

} 

int remove base_data(tnt current base) 
30 { ' . ~ 

int i; 

int curt; //current fourmer 

if(sequence[current_base] < 0) retum(l); 

//calculate complement 
35 curt — fourseq[current_base]; 

//remove current 4mer prohibitions 

if(fourmers[curt] > -I) founners(curt] = 1000; 

//reinstate prohibitions for previous copies of this 4mer 

for(i=0; i<current_base; ++i) { 
40 if(fourseq[i] = = curt) { 

if(founner5[curt] > spacings[i]) fourmers[curt] = spacings[i]; 

} 

45 //adjust base count 

— ba5ecnt[s6quence[current_base]]; 
retuni(l); 

} 

int complement(int fourmer) 

50 { 

int i; 
int comp; 

//assuming fourmers in four bases 
for<i=0, comp=0; i<4; ++i) { 
55 c mp *= 4; 



) 
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comp -1-= 3-(fournier%4); //add complement of base 

founner /= 4; 

} 

retuni(coinp); 

5 } 

double seq_tojnl0 
{ 

int i; 

double seq_int = 0; 
10 //assumes 4 bases 

for(i=PRETAG; i< LENGTH; + + { 
seq_int 4; 
seq_int + = sequeDce[il; 
//prinlfC\n%ld%seq_uit); 

15 } 

retum(s6q_int); 
int piint_sequence() 
20 int i; 



} 



for(i=PRETAG; i<LENGTH; + + i) fprintf(outfile/%cMabeUsequencelil]); 
retum(l); 



25 Example 4: Tags895.rrp 

A« preferred method of selecting probe arrays involves discarding all probes 
from a pool of probes which have identical 9 mer runs (thereby eliminating many tags 
which will cross-hybridize), followed by pair-wise comparison of the remaining tag 
nucleic acids and elimination of tags which hybridize to the same target. An exemplar 

30 program (Tags895.ccp) written in "C" is provided below. 

^^include <stdio,h> 
^include <sldlib.h> 
^include <niath.h> 
^include <alloc.h> 

35 

char out[] = "TAGS.OUT"; 
char hout[] = "HIST.OUT"; 
char labein = "ACGT"; 

40 Mefine BASES 4 //number of nucleotides 

//the following numbers include all the nonperiodic bases at the 

//end of the primer (in this case one C) in addition to the tag 

^define PRETAG 0 //bases of primer preceding tag 

^define LENGTH 20 . //length of tog plus pretag 

45 Mefine GCUM 11 //max totol of G*s and C's allowed 

Mefme ATLIM 11 //max total of A's and T*s allowed 

^define AAUM 7 //max toUl of A's allowed 

^define CCLIM 6 //max total of C's allowed 

^define GGLIM 6 //max total of G's allowed 

50 ^define TTLIM 7 //max total of T's all wed 

^define NUMERS 256 //number of fourmers 

^define LMAT 10 //length f long matches prohibited 



) 

f 
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/^define LNUM 1048576L //number of longmers 

Mefine MXNUM ( 32768 / sizeof(int)) 

short numVectors; 
5 int far ** IngVectors; 



int test_base(int current_base); 
int remove_base_clata(mt current_base); 
int complement(int founner); 
10 double seq_to_int(void); 

int print_sequence(void); 

int patteni[LENGTHl[BASES]; //identifies allowed bases each position 

int fourmers(NUMERS]; //identifies prohibited 4niers 

15^ int sequence[LENGTHl; //base at specified position 

int foUTseq[LENGTH]; //fourmers ending at specified position 

long Iongseq[LENGTHl; //longmers ending at specified position 

int basecnt[BASES1; //number of occurrences of each base to left of 

. //currentjosition 

20 int spacings[LENGTHl; //position past (or =) to which an occurrence of 

//a complementary fourmer to the one at this 
//position is prohibited 

FILE *outfile,*houtfile; 
25 main() 

{ 

int ij,k,cbs; //counters 

long temp,c_b; //temp storage of longmer 

int cseq[LENGTH] = {2,0.0.0,1,0.0.0,2,0,2,2.2.3,2}; //sample Ug 

30 long xpas; //number of passes thru loop over 4 

long max tag; //max number of different tags 

int current_base; //current sequence position within backtracking 
int tag_count; //count of acceptable tag sequences 
int pass; //flag for whether Ug was acceptable 

35 int exflag; //flag indicating completion of all possible tags 

double cur_seq; //integer representation of current tag sequence 
double prev_seq; //integer representation of previously tested tag 

int histogram[LENGTH1; //histogram of mismatches to cseq 
int mismatches; 

40 

numVectors = (LNUM/MXNUM); 

higVectors = (int ♦ *) farcalloc(numVectors, sizeof(int far >)); 
if(higVectors = = NULL) { 

printf('\nerror allocating IngVectors!"); 
45 exit(l); 

} 

for(i=.0; i< numVectors; + -hi) { 

higVectors[i] = (int * ) farcalloc(MXNUM, sizeof(int)); 
if(higVectors[il == NULL) { 
50 . printf("\nerror allocating vectors: i = %d\n",i); . 

exitCD; 

} 

) 

//open output file 
55 outfile = fopen(out,"wt"); 
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if{NULL outfile) { 

printf("\nerror opening output file %s\ii", out); 
exit(l); 

} 

//open houtput file 

houtflle = fopen(hout,"wt"); 

if(NULL = = houtflle) { 

printfC^Vnerror opening output file %s\n",hout); 

fciose(outfile);exit( 1); 

} 

//initialize histogram 

for(i=0; i< LENGTH; + +i) histogram[i] « 0; 
//initialize pattern array 

for(i=0; i< LENGTH; + +i) for(k=0; k< BASES; + -hk) pattem[i][k] 
//initialize 4nier array 

for(i=0; i<NUMERS; -f-+i) founners[i] ^ ICXX); 
//initialize 9mer array 

for(i=0; i<LMAT; ++i) lngVectors(!/MXNUM][i%MXNUM] = 0; 
//Runs marked 

founners[0] == fourmers[85] = founners[170] = founners[2551 = -I; 
//initialize sequence 

for(i-0; i< LENGTH; ++i) sequence[il = 0; 
.//initialize spacings to disregard incomplete words at beginning 
for(i=0; i<3; + + i) spacings[i] = 1000; 
//initialize spacings to require 6mer palindrome 
for(i=3; i<LENGTH; ++i) spacings[i] = i-H2; 
//initialize base sequence, current position,base count.tag count etc. 
//start at "largest" sequence we know is wrong 
.sequence[0] = -I; 
current_base = 0; 

basecnt[0] = basecnt[l] » basecnt[21 ~ basecnt[31 0; 
tag^count = 0; 
prev_seq = -1; 

//initialize fpurseq for fourmers up to current_base 
//don't need to since it is at zero and it will be reset 
//initialize fourmers for fourmers up to current_base 
//doa*t need to do anything because only one is at zero and 
//it will be removed shortly anyway 
//calculate maximum number of different tags 
for(maxUg = 1,1=0; i<LENGTH-PRETAG-l; ++i) { 
maxtog *= BASES; 

} 

printf("\nmaxtag — %ld\n*,maxtag); 

//THIS IS A DEBUG STATEMENT!! REMOVE LATER 
maxUg = (long) 10000000000; 

pass = 0; 

//until all probes are exhausted 

for(j=0, xpas=0,exflag=0; exflag != 1 &Sc xpas<maxtag; + +j) { 
ifO==4){ 

+ +xpas; 



) 
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20 



30 



} 

*/ 



51 

j=0; 



//DEBUG 

//fprmtf(outfile,"\nNOW: 
/ /print_sequence() ; 

//backtrack to last incrementable base 

for(; sequence[current_basc] BASES-1 | | (pass= = l && current base>7V 
~current_base) { - - 

//remove current base data 
remove_base_data(current_base) ; 
//set base to zero 
sequence[current_base] = 0; 

if(current_base < PRETAG) { 
exflag =1; 
continue; 

} 

//remove current base, data 
remove_base_data(current_base) ; 
//increment current_base 
+ +sequence(current_base); 
//error checking to ensure sequence is increasing 
//DEBUG 

25 //fprintf(outfiIe.-\nUPD: "); 

//print_sequence() ; 
cur_seq = 5eq_toJnt(); 
if(cur_seq < = prev seq) { . 
print_sequence() ; 

printf("\n\n!!ERRROR:curTent_base = %d, tur seq = 9Sd. prev seq = %d j 
= %d\n\curTent_base,cur_^seq,prev_seq j); " - ^ / J 

fclose(outnie);exit(l); 
> 

prev_seq = cur seq; 
35 ♦/ ~ 

//update base data and test until reach end or failure 

Mpass = 1; current^base < LENGTH && pass == 1; + +current^base) 

pass = test_base(current_base); 
//«f testing was successful, print sequence 
^0 ~current_base; 

//extra check to be sure not repeating longmers 
if(pass == I) { 

//record prohibitions on 9mers 
foKcbs ^ LMAT-1; cbs < LENGTH; + +cbs) { 
c_b = longseq[cbs]; 

if(lngVectors[c_b/MXNUM][c_b%MXNUM]!= 0) { 

. , ; , printf("\n\n!!ERROR: already matching 9mer found: c b = 

%ld, current_base « %d, longseq = %ld", e 

c_b, cbs, Iongseq[cbs]); 
fclose(outfiIe);exit(l); 

} 

//record longmer prohibitions 
Mcbs = LMAT-1; cbs < LENGTH; ++cbs) { 
•^•^ c_b = longseq[cbs]; 
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^ lngVectors[c_b/MXNUM][c_b %MXNUM] = 1 ; 

//increment tag count 
+ +tag_count; 
//print tag sequence 
fprmtf(outfile,"\n"); 
print_sequeace(); 
//calculate mismatches with ref sequence, cseq 
mismatches = 0; 

for(i=0; KLENGTH; ++i) if(cseq[i] != sequence(i]) ++ mismatches; 
-t- +histogram{ mismatches]; 
if(mismatches < » 2) { 

fprintf(houtfiIe, "\n"); 

for(i=PRETAG;i<LENGTH; ++i) 
frnntf(houtfile, " %cMabeI[sequence(i]]); 

//fprintf(outfile/ OK!"); 

} ^ 

//if exited on j > =MAX record error and exit 
if(i>=maxug) 

printf(-\n!!ERROR: exceeded allowable passes through primary loop\n"); 
//print tag count 

fprintf(outfile."\n\nTag Count: %d\n".Ug count)* 

fSoutfSeV^^^^^ ""^'^ fprintf(houtfile/\nWd mismatches: %d-.i.histogram[il); 

fclose(houtfile); 
retum(l); 

} 

int test_base(int current_base) 

int comp; //complement 
long c_b; 

//increment base count (must be accomplished before exiting this function) 
+ +basecnt(sequence[current_base]] ; 
//fiprintf(outfi!e/\nbase = %d ".current_base); 
//test base count, return upon failure 
if(basecnt[0] -4- basecnt[3] > ATLIM) { 

//fjprintfCoutfile, -failed AT limit"); 

retum(O); 

} 

if(basecnt[l] -h basecnt(2] > GCLIM) { 

//fiprintf(outfile/faiIed GC limit"); 
retum(O); 

} 

if(basecnt(0] > AALIM) { 

//lprintf(outfiIe/ failed A limit"); 
retuni(0); 

} 

if(basecnt[l] > CCLIM) { 

//fprintf(outfile, "failed C limit"); 
retum(O); 

} 



) 
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if(basecnt[2] > GGLIM) { 

//fiprintf(outfile, "failed G limit"); 
retuni(O); 

5 if(basecnt[3] > TTLIM) { 

//fprintf(outfile, "failed T limit"); 
retum(O); 

} 

/♦ 

10 //test if base matches patterns, return upon failure 

if(pattem[cuiTent_base][sequence[current_base]l != I) { 
//fprintf(outfile/ failed pattern match"); 
retum(b); 

15 //if at last base verify checksum 

if(current_base == LENGTH- 1) 

if(0 == ((basecnt[0]-fbasecnt[21)%2)) { 

//fprintf(outfile, " failed checksum"); 
retum(0); 

20 } 
*/ 

//compute current 4mer 

if(current_base ==0) fourseq[0] = sequence[current_ba5e]; 

else fourseq[current_base] = (4*fourseq[current_basc.l ]) %256 . + sequence[curTent_basel; 
25 //if this is a full 4mer check 4mer and return upon failure 

if(current_base > 2) { 

comp = compleraent(fourseq[currenl_base]); 
if(fourmers[comp] < = current^base) { 

//^^rinlf(outfile,"fourmer failedicurt: %d. comp: %d, spacing: %d*, 
30 // fourseq[current_base],comp,fourmers[comp]); 

retum(O); 

} 

} 

//compute current 9mer if full 9mer 
35 if(current_base =^0) longseq(OJ = sequence(current_base]; 

else iongseq[curTent_^base] - (4*longseq[curTent_bas^^ll)%LNUM + sequencefcurrcnt_,base|; 
. //if full longmer check 9mer and return upon failurje 
if(current^base > LMAT-2) { 

c_b = longseq[currenl_basel; 
*0 if(lngVectors(c^b/MXNUM]Ic_b%MXNUM]= = l){ 

//print7("hello"); 

//iprintf(outfile, "longmer failed:c_b: %ld. lngVectors[c bl: 
%d",c_b,higVectorstc_b]); 

retum(O); 

45 } 
} 

//record prohibitions on 4mer 

if(current_base > 2 && fourmers[fourseq[current_base]] > spacings[current_base]) 
fourmers[fourseq(current_basell = spacings[current_base]; 
50 //all tests passed!! Add new base: 

//fiprintf(outfile, " passed! "); 
//passed all tests 
retum(l); 

} 

55 int remove_base_data(int current_base) 



) 
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54 

{ 

int i; 

int curt; //current fourmer 
if(sequence[curreat_base] < 0) retum(l); 
//calculate complement 
curt = fourseq[current_base]; 
//remove current 4roer prohibitions 
if(fourmers[curtl > -1) fourmers[curt] = 1000; 
//reinstate prohibitions for previous copies of this 4mer 
for(i=0; i<current_base; ++i) { 
if(fourseq[i] = = curt) { 
^ if(fourmers(curt] > spacings[i]) founners[curt] spacings[i]; 

//adjust base count 
-basecnt[sequence(current_base1] ; 
retum(l); 

} 

int complement(int fourmer) 
20 { 

int i; 
int comp; 

//assuming fourmers in four bases 
25 for(i=0. comp=0; i<4; ++i) { 

comp *= 4; 

comp + = 3-(fourmer%4); //add complement of base 

fourmer /= 4; 

30 retum(cdmp); 
} 

double seq_to_inl() 
{ 

int i; 

double seq_int = 0; 
//assumes 4 bases 
for(i=PRETAG; i< LENGTH; ++i) { 
seq_int *= 4; 
seq^int += sequence^]; 
40 //printf("\n%ld-,secLint); 

retum(seq_int); 

int print sequence() 
45 { - 

int i; 

fof(i=.PRETAG; i< LENGTH; ++i) fprintf(outfile."%cMabel(sequence(m); 
return(l); 

} 



35 



50 



A truncated output list is provided below: 



♦♦/SAMPLE OUTPUT FILE FOR TAGS895.CPP /** 
AAACAAACACCCGCGTGGTT 
55 AAACAAAGACCCGCCGGTGT 



■. ) 

/ 



55 

AAACAAATACCCGCCGTGGG 

AAACAACAACCCGCGTGTGG 

AAACAACCAACCCGGTGTGG 

AAACAACGAACCCGCTGGTG 
. 5 AAACAACTAACCCGCTGTGG 

AAACAAGAACCCGCCGTTGG 

AAACAAGCAACCCGGCGTGT 

AAACAAGGAACCCGCCTGGT 

AAACAAGTAACCCGCCTTTG 
10 AAACAATAACCCGCGCTGGG 

AAACAATCAACCCGCTTGGG 

AAACAATGAACCCGCGTCGG 

AAACAATTAACCCGCGTTCG 

AAACACAAACCCGGCTGGTG 
15 AAACACACAACCCGTGGTGG 

AAACACAGAACCCGCTTTGG 

AAACACATAACCCGGCGGTG 

AAACACCAAACCCGTTGTGG 

AAACACCCAAACCGTTGTGG 
20 AAACACCGAAACCCTGTGGG 

AAACACCTAAACCCTTGTGG 

AAACACGAAACCCGGTCGGT 

AAACACGCAAACCCGGTGGT 

AAACACGGAAACCCTCGGTG 
25 AAACACGTAAACCCGTCGGT 

AAACACTAAACCCGTGCGGT 

AAACACTCAAACCCTGGTGG 

AAACACTGAAACCCGTCTGG 

AAACACTTAAACCCGTTCGG 
30 AAACAGAAACCCGCTCGGTG 

AAACAGACAACCCGGCTTGG 

AAACAGAGAACCCGGCCTTG 

AAACAGATAACCCGCTCTTG 

AAACAGCAAACCCGTGGCGT 
35 AAACAGCCAAACCGCGTGGT 

//many lines removed here // 

Tag Count: 14507 

AH publications and patent applications cited in this specification are herein 
incorporated by reference for all purposes as if each individual publication or patent 
application were specifically and individually indicated to be incorporated by reference. 

Although the foregoing invention has been described in some detail by way 
45 of illustration and example for purposes of clarity of understanding, it will be readily 

apparent to those of ordinary skill in the art in light of the teachings of this invention that 
certain changes and modifications may be made thereto without departing from the spirit 
or scope of the appended claiins. 



